MODEL

MiniMax M3

modeltopic-noteopen-sourcechinaminimax

Overview

MiniMax M3 is MiniMax‘s third-generation flagship model, announced 2026-06-02 with a new MiniMax Sparse Attention architecture targeting long-context efficiency. The model is open-weight with weights set to drop to Hugging Face and GitHub within 10 days of the announcement, claiming ~1/20th compute at 1M tokens, 9× faster input and 15× faster generation vs dense attention at long context, trained on 100T interleaved multimodal tokens. Vendor-published benchmarks (SWE-Bench Pro 59%, BrowseComp 83.5) are unaudited but credible enough that even partial validation makes M3 the load-bearing open-weights story of its launch week.

Timeline

  • 2026-06-02-AI-Digest — MiniMax announces M3 with MiniMax Sparse Attention, claiming ~1/20th compute at 1M tokens, 9× faster input and 15× faster generation vs dense attention at long context, trained on 100T interleaved multimodal tokens, with weights set to drop to Hugging Face and GitHub within 10 days. Vendor benchmarks: SWE-Bench Pro 59% (ahead of GPT-5.5 and Gemini 3.1 Pro, just behind Opus 4.7) and BrowseComp 83.5 (beats Opus 4.7’s 79.3). All numbers vendor-published and unaudited — but if even half of the sparse-attention efficiency holds at scale, this is the actual open-weights story of the week, landing days after the SimSD speculative-decoding-for-diffusion-LMs paper makes long-context serving cheaper to discuss in general.

Key Developments

  1. MiniMax Sparse Attention — First Credible Sparse-Attention Numbers at Long Context (June 2, 2026): The architecture’s headline claims (1/20th compute at 1M tokens, 9× faster input, 15× faster generation vs dense at long context) are the first vendor-published sparse-attention efficiency numbers at long context from the open-weight cohort. Pairs structurally with the same week’s SimSD speculative-decoding-for-diffusion-LMs result — long-context serving is getting structurally cheaper from two independent architectural directions.

  2. Open-Weights Distribution on a 10-Day Clock: The “weights to HF + GitHub within 10 days” commitment is the credibility test on the sparse-attention claims — once weights drop, independent reproduction of the 1M-token throughput numbers is feasible. Until then, treat the benchmarks as vendor-published and unaudited.

  3. Benchmark Position: SWE-Bench Pro 59% places M3 ahead of GPT-5.5 and Gemini 3.1 Pro, just behind Claude Opus 4.7; BrowseComp 83.5 beats Opus 4.7’s 79.3. The narrow gap to a frontier closed model on open weights at long context is the load-bearing capability claim — distinct from the efficiency claim above.

See also: MiniMax, MiniMax-M2, Claude Opus 4.7, MOC - Open Source Models, MOC - AI Infrastructure.