TOOL
ExLlamaV3
developer-tooltopic-note
Overview
ExLlamaV3 is an inference engine for quantized language models on consumer GPUs, developed by turboderp. It is a primary runtime for the local-inference community running models on single workstation GPUs.
Timeline
- 2026-05-12-AI-Digest — Turboderp shipped a rapid sequence of ExLlamaV3 releases (r/LocalLLaMA, 145 points): Gemma 4 support, improved cache efficiency, and DFlash. The dev branch is accumulating commits at a high cadence. Throughput and model-compatibility changes in ExLlamaV3 propagate directly to anyone running local models on a single workstation GPU.
Key Developments
- Gemma 4 Support + DFlash: The May 2026 ExLlamaV3 release cluster adds Gemma 4 support, improved cache efficiency, and DFlash — continuing rapid-cadence development that makes ExLlamaV3 a primary conduit for new model support and inference optimizations reaching consumer-GPU users.