MODEL

LLaVA-OneVision-2

modeltopic-notemultimodal

Overview

LLaVA-OneVision-2 is the new flagship in the LLaVA-OV multimodal model line, introducing codec-stream tokenization that treats compressed video as a continuous bit-cost stream, paired with windowed attention and a shared 3D RoPE unifying images, frames, and long video. The codec-stream framing is a substrate move that, if it generalises, unlocks long-video work on far smaller context budgets.

Timeline

  • 2026-05-27-AI-Digest — Paper “LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence” surfaces on HuggingFace Papers (arXiv:2605.25979, ▲943 — the week’s highest HF Papers community signal). Codec-stream tokenization plus windowed attention and a shared 3D RoPE unifies images, frames, and long video under one positional scheme.

Key Developments

  1. Codec-Stream Tokenization (May 27, 2026): Treating compressed video as a continuous bit-cost stream is a substrate move with potential to unlock long-video work on far smaller context budgets if it generalises. ▲943 on HuggingFace Papers is the week’s highest community signal.

See also: MOC - Open Source Models.