TOOL

ExLlamaV3

developer-tooltopic-note

Overview

ExLlamaV3 is an inference engine for quantized language models on consumer GPUs, developed by turboderp. It is a primary runtime for the local-inference community running models on single workstation GPUs.

Timeline

  • 2026-05-12-AI-Digest — Turboderp shipped a rapid sequence of ExLlamaV3 releases (r/LocalLLaMA, 145 points): Gemma 4 support, improved cache efficiency, and DFlash. The dev branch is accumulating commits at a high cadence. Throughput and model-compatibility changes in ExLlamaV3 propagate directly to anyone running local models on a single workstation GPU.

Key Developments

  1. Gemma 4 Support + DFlash: The May 2026 ExLlamaV3 release cluster adds Gemma 4 support, improved cache efficiency, and DFlash — continuing rapid-cadence development that makes ExLlamaV3 a primary conduit for new model support and inference optimizations reaching consumer-GPU users.