MODEL

Qwen-VLA

modeltopic-noteopen-sourcevlaroboticsalibaba

Overview

Qwen-VLA is a vision-language-action (VLA) foundation model from the Qwen stack designed to unify modeling across tasks, environments, and robot embodiments. It extends the Qwen stack with a DiT action decoder, embodiment-aware prompting, and unified action-trajectory prediction.

Timeline

  • 2026-05-30-AI-Digest — Paper published as arXiv:2605.30280 (▲82). Headline benchmarks: 97.9% on LIBERO, 86.1% / 87.2% on RoboTwin-Easy/Hard, 76.9% OOD success in real ALOHA experiments. Why it matters: the “one model, many embodiments” thesis gets a concrete, scoreable instantiation from a frontier open-weights lab.

Key Developments

  1. Embodiment-Aware Unified VLA from the Qwen Stack: A single VLA foundation model extending the Qwen stack with a DiT action decoder, embodiment-aware prompting, and unified action-trajectory prediction. Cross-embodiment generalisation gets a concrete benchmarked instantiation from an open-weights lab (2026-05-30-AI-Digest).

  2. Strong LIBERO / RoboTwin / ALOHA Results: 97.9% LIBERO, 86.1/87.2% RoboTwin-Easy/Hard, 76.9% OOD success in real ALOHA experiments — the real-ALOHA OOD figure is the more practitioner-relevant data point for cross-embodiment generalisation claims.

See also: Qwen, MOC - Open Source Models.