MODEL

Gemma 4

modeltopic-noteopen-source

Gemma 4

Google DeepMind’s most capable open-weight model family, released April 2, 2026 under the Apache 2.0 license. Successor to the Gemma series, distinct from the Gemini proprietary model line.

Key Specs

  • Four sizes: E2B, E4B, 26B MoE, and 31B Dense
  • 31B Dense ranks #3 on Arena AI text leaderboard among open models
  • All variants natively process video and images at variable resolutions
  • E2B and E4B add native audio input for speech recognition
  • Apache 2.0 license — fully permissive for commercial use

Timeline

  • 2026-04-04-AI-Digest — Google DeepMind launches Gemma 4 under Apache 2.0 with four sizes; 31B Dense ranks #3 among open models on Arena AI leaderboard. Community benchmarks on r/LocalLLaMA show competitive performance with Qwen 3.6 and Llama 4 at similar parameter counts. Successful local deployment via Ollama on consumer hardware confirmed.

  • 2026-04-05-AI-Digest — Apache 2.0 licensing confirmed as strategic shift from restricted-use; 400M+ cumulative downloads across generations; community reports successful fine-tuning with LoRA on consumer hardware (RTX 4090 for 26B variant); benchmarked against Llama 4 and Qwen 3.6.

  • 2026-04-06-AI-Digest — Android AICore Developer Preview announced; Apache 2.0 driving adoption past 400M downloads; #3 on Arena AI leaderboard.

  • 2026-04-07-AI-Digest — Gemma 4 referenced in community discussions around open model landscape alongside DeepSeek V4

  • 2026-04-07-AI-Digest — Gemma 4 mentioned in community context as researchers design peer preservation replication experiments.

  • 2026-04-09-AI-Digest — With Llama effectively retired from frontier open-weights competition after Meta’s Muse Spark closed-source pivot, r/LocalLLaMA threads now treat Gemma 4 31B and Qwen 3.5 as the new top of the open-weights stack. Gemma 4 31B wins on multimodal, long-context retention, multilingual, and structured output (JSON/markdown tables); Qwen 3.5 still wins on coding and tool calling with hybrid thinking mode. Both fit on a 24 GB RTX 4090 at 4-bit quantization.

  • 2026-04-12-AI-Digest — One week post-launch, r/LocalLLaMA sentiment consolidates around Gemma 4 31B Dense as the strongest all-around open model. Quantized 31B vs. Qwen 3.5 comparisons on a single RTX 4090: Gemma wins multimodal, structured output, long-context; Qwen retains the coding/tool-calling edge. Community practical consensus: run both with a router.

  • 2026-04-13-AI-Digest — Gemma 4’s Apache 2.0 licensing highlighted as the key differentiator changing the open-model calculus; 31B Dense outperforms Llama 4 on AIME 2026 Math (89.2% vs 88.3%), LiveCodeBench v6 (80.0% vs 77.1%), and GPQA Diamond (84.3% vs 82.3%), delivering a model that’s both technically competitive with 20x-larger models and legally frictionless for enterprise deployment.

  • 2026-04-14-AI-Digest — Gemma 4 remains a top r/LocalLLaMA community default alongside Qwen 3 Coder and Llama 4 in the April Hugging Face momentum tracker. Community consensus has stabilized into a Llama Stack + Gemma 4 + Qwen 3 Coder + DeepSeek V3 workflow across practical open-weights deployments.

  • 2026-04-15-AI-Digest — Gemma 4 continues to be referenced alongside Qwen 3.5, Llama 5, and imminent DeepSeek V4 in r/LocalLLaMA’s open-weights stack; Stanford AI Index places Chinese labs within 1.70% of top-US-model performance, putting Gemma 4’s Apache 2.0 positioning in a more competitive open-weights field.

Context

Gemma 4 enters a six-way open-weight competition between Google, Alibaba (Qwen), Meta (Llama), Mistral, OpenAI (gpt-oss-120b), and Zhipu AI (GLM). See also: Qwen, Gemini, MOC - Open Source Models.

  • 2026-05-05-AI-Digest — A News flair thread on r/LocalLLaMA flagged a chat-template fix for Gemma 4 alongside fresh GGUF quantizations from bartowski and unsloth across the 2B–31B range. The signal is corpus-mood rather than capability — quick-turnaround quantizations remain the open-weights ecosystem’s main lever for moving new releases into practitioners’ hands within a day or two of the upstream cut.
  • 2026-05-06-AI-Digest — Google released Gemma 4 multi-token-prediction (MTP) draft models targeting ~3× speculative-decoding speedups via draft-model agreement on common token patterns; timing follows llama.cpp beta MTP support with Qwen3.5 and narrows single-stream latency gap with vLLM on open-weights side. MTP drafter ships into speculative-decoding pipeline a day after llama.cpp support, creating parity window with vLLM for local inference.
  • 2026-05-07-AI-Digest — Gemma 4’s MTP drafter pattern extended to non-Google models in r/LocalLLaMA community thread; open-weights community now extending Google’s speculative-decoding drafter pattern to non-Google frontier models, narrowing gap with hosted-vLLM serving for single-stream agentic loops at 262K context.
  • 2026-05-09-AI-Digestz-lab‘s gemma-4-26B-A4B-it-DFlash drafter benchmarked at ~600 tok/s on a single RTX 5090 against vLLM 0.19.2rc1 with num_speculative_tokens=8, up from ~228 tok/s baseline on the cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit main + DFlash draft pair (256-input / 1024-output random workload). Pair with the Luce DFlash timeline in 2026-04-28-AI-Digest — DFlash is now a multi-vendor drafter pattern across Qwen and Gemma rather than single-implementation novelty. Worth holding loosely: a separate community benchmark of llama.cpp speculative-decode modes on RTX 3090 reports no net speedup, so the headline number is hardware/config-specific.
  • 2026-05-22-AI-Digest — A HN write-up (~346 pts, ~102 cmts) brute-forces Gemma4-31B multimodal inference on a 2021 MacBook by leaning on 50GB of swap to index a year of video locally. Feasibility data point, not a usability one — the 50GB swap caveat is the real story. Reads as: mid-30B multimodal models are now runnable on consumer hardware as overnight batch jobs, not as interactive workflows. Useful evidence in the “local-first multimodal works if you treat it as a batch job” file; misleading if framed as “Gemma 4 is now usable on older hardware.”
  • 2026-06-04-AI-DigestGemma 4 12B ships from Google / DeepMind: 11.95B params, Apache-2.0, natively multimodal, encoder-free (text + image + audio in one stack), the first mid-sized Gemma with native audio. Google claims it “nearly matches” Gemma 3 27B on GPQA Diamond, MMLU Pro, and DocVQA while running on a single 16 GB-RAM laptop; available on HF, Ollama, and LM Studio at release. Most-discussed AI launch on HN today. Practitioner read: if you’ve been running Gemma 3 27B on a 24 GB workstation, Gemma 4 12B is a same-class drop-in that frees the headroom; the native audio path is the new capability over v3. Honest framing against today’s Aider polyglot top-5 (all closed reasoning models — GPT-5, o3-pro, Gemini 2.5 Pro): “open-weights compressing the size-to-quality curve internally,” not “open catching up to the closed frontier.”
  • 2026-06-06-AI-DigestGemma 4 QAT (quantization-aware-trained) checkpoints released by Google aimed at mobile and laptop hardware — the E2B variant fits in ~1 GB. Landed on HN at 310 pts / 91 cmts. Concrete deployment-efficiency drop on the Gemma 4 family released last week (2026-06-04-AI-Digest) — the on-device substrate keeps shifting down a tier even while the Aider polyglot top-5 (fetched today) remains wall-to-wall closed reasoning. The practitioner read is “1 GB multimodal on a phone,” not “open caught the frontier.”