TOOL
Luce DFlash
Overview
Luce DFlash is an MIT-licensed GGUF implementation of DFlash speculative decoding for Qwen3.6-27B. Released in April 2026, it achieves approximately 2x throughput on consumer hardware (RTX 3090) with KV cache compression, sliding-window flash attention, and an OpenAI-compatible HTTP endpoint.
Timeline
- 2026-04-28-AI-Digest — Luce DFlash released on r/LocalLLaMA as a complete, MIT-licensed speculative-decoding implementation with ~1.98x measured mean speedup, KV cache compression (TQ3_0), sliding-window flash attention, and OpenAI-compatible endpoint.
- 2026-05-09-AI-Digest — DFlash extends from a Qwen-only drafter into a multi-vendor pattern: z-lab publishes both Gemma 4 26B-A4B and Qwen3.6-27B DFlash drafters with a stateful-KV-cache claim distinguishing DFlash from MTP drafters. The Luce DFlash arc and z-lab’s drafters together establish DFlash as a multi-vendor drafter pattern across Qwen and Gemma rather than a single-implementation novelty.
Key Developments
-
Practical Speculative Decoding for Consumer Hardware: Luce DFlash brings speculative-decoding performance gains (typically a research paper result) to the consumer-hardware envelope, enabling hobbyists to run Qwen3.6-27B at usable throughput on a single 24 GB GPU.
-
Complete Open-Source Tooling: The release includes turnkey GGUF quantization, KV cache optimization, and a ready-to-use HTTP server, reducing the friction between “research results” and “deployable inference workflow.”