COMPANY
z-lab
Overview
z-lab is the open-weights drafter publisher behind the DFlash speculative-decoding pattern that emerged on r/LocalLLaMA across April–May 2026. The lab ships DFlash drafters paired against open-weights main models — initially Gemma 4 26B-A4B, then Qwen3.6-27B — claiming a stateful drafter (KV-cache positions and RoPE offsets persist across iterations) where MTP drafters are not. z-lab is now the multi-vendor face of DFlash as a drafter pattern rather than a single-implementation novelty.
Timeline
- 2026-05-09-AI-Digest — z-lab’s gemma-4-26B-A4B-it-DFlash drafter benchmarked at ~600 tok/s on a single RTX 5090 against vLLM 0.19.2rc1 with
num_speculative_tokens=8, up from a ~228 tok/s baseline on the cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit main + z-lab DFlash draft pair (256-input / 1024-output random workload). A separate r/LocalLLaMA thread the same day announces z-lab’s Qwen3.6-27B DFlash drafter and claims DFlash is stateful (KV-cache positions and RoPE offsets persist across iterations) where MTP drafters are not. Worth holding loosely: a parallel community benchmark ofllama.cppspeculative-decode modes on the same model class on RTX 3090 reports no net speedup, so the headline number is hardware/config-specific.
Key Developments
-
Multi-Vendor DFlash Drafter Pattern: z-lab now ships DFlash drafters for both Gemma 4 26B-A4B and Qwen3.6-27B, establishing DFlash as a multi-vendor drafter pattern across Qwen and Gemma rather than a single-implementation novelty (the original Luce DFlash release was Qwen-only).
-
Stateful vs. MTP Drafters: z-lab’s claim that DFlash is stateful (KV-cache positions and RoPE offsets persist across iterations) where MTP drafters are not is the substantive architectural distinction. If validated independently it positions DFlash as a structurally different speculative-decoding path rather than a packaging variant.
-
Hardware/Config-Specific Headline: ~600 tok/s on RTX 5090 with
num_speculative_tokens=8and the cyankiwi AWQ-4bit main pair is the load-bearing single-config win; community RTX 3090llama.cppbenchmarks show no net speedup, so the result is best read as a hardware-and-quant-specific deployment milestone rather than a universal “DFlash beats MTP” claim.
Related
See also: Luce DFlash, Gemma 4, Qwen3.6-27B, MOC - Open Source Models.