MODEL
Qwen-VLA
Overview
Qwen-VLA is a vision-language-action (VLA) foundation model from the Qwen stack designed to unify modeling across tasks, environments, and robot embodiments. It extends the Qwen stack with a DiT action decoder, embodiment-aware prompting, and unified action-trajectory prediction.
Timeline
- 2026-05-30-AI-Digest — Paper published as arXiv:2605.30280 (▲82). Headline benchmarks: 97.9% on LIBERO, 86.1% / 87.2% on RoboTwin-Easy/Hard, 76.9% OOD success in real ALOHA experiments. Why it matters: the “one model, many embodiments” thesis gets a concrete, scoreable instantiation from a frontier open-weights lab.
Key Developments
-
Embodiment-Aware Unified VLA from the Qwen Stack: A single VLA foundation model extending the Qwen stack with a DiT action decoder, embodiment-aware prompting, and unified action-trajectory prediction. Cross-embodiment generalisation gets a concrete benchmarked instantiation from an open-weights lab (2026-05-30-AI-Digest).
-
Strong LIBERO / RoboTwin / ALOHA Results: 97.9% LIBERO, 86.1/87.2% RoboTwin-Easy/Hard, 76.9% OOD success in real ALOHA experiments — the real-ALOHA OOD figure is the more practitioner-relevant data point for cross-embodiment generalisation claims.
Related
See also: Qwen, MOC - Open Source Models.