Anthropic ships the first public progress report on Project Glasswing, its interpretability and alignment research initiative — the post hits the HN front page with 371 points and sustained technical discussion.

AI Digest — May 23, 2026

Your daily deep-dive on AI models, tools, research, and developer ecosystem news.

🔖 Project Releases

Claude Code

v2.1.149 → v2.1.150 — two more releases since yesterday’s digest (which cut at v2.1.148). v2.1.149 (2026-05-22) is the substantive one: /usage now shows a per-category cost breakdown (skills, subagents, plugins, MCP servers), giving operators a long-awaited view of where token spend is actually going; /diff gets full keyboard scrolling (arrows, j/k, PgUp/PgDn, Space, Home/End); GFM task-list checkboxes finally render in markdown; and an enterprise allowAllClaudeAiMcps managed setting lands. Bug fixes worth flagging: a PowerShell cd-function permission bypass, the sandbox write allowlist in git worktrees, and a find-call pattern that was exhausting the macOS vnode table on large repos. v2.1.150 (2026-05-23) is infrastructure-only — no user-facing changes, a same-day point release stacked on top of yesterday’s feature drop. The accelerated cadence is real but worth not overreading: trailing 11 days runs ~0.9 releases/day, so this week’s four-in-three-days burst is concentrated, not the new steady state. cf. 2026-05-22-AI-Digest.

Beads

No new release this week — v1.0.4 from earlier this month remains current. Main-branch activity continues but the tag cadence has now been quiet for ~14 days, second consecutive digest with nothing new to report.

OpenSpec

No new release this week — v1.3.1 “Path & Telemetry Fixes” from late April remains current. ~32 days since last release; the slowdown is now a sustained pattern rather than a single quiet stretch, worth tracking for anyone building on top.

🧵 From the Community

Aider polyglot top-5 (fetched 2026-05-23): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3%

The board is identical to yesterday’s snapshot — GPT-5 still sweeps four of five slots, o3-pro holds third, and Gemini 2.5 Pro’s preview is the only non-OpenAI entry. Notably absent again: Gemini 3.5 Flash (out for ~four weeks now) and any newer Gemini variant — Aider still hasn’t independently validated Google‘s I/O claims.

Papers

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards (arXiv:2605.21467, ▲125) — Recasts the RLVR policy-gradient update as a linear discriminator over token-gradient vectors, then shows standard sequence-level RLVR gets dominated by high-frequency formatting tokens; DelTA reweights tokens to amplify side-specific gradient directions, beating same-scale baselines by 3.26 / 2.62 average points on Qwen3-8B/14B-Base across seven math benchmarks. Why it matters: a principled credit-assignment fix for the now-dominant RLVR reasoning recipe, with clean gains on the same backbones everyone’s already using.
Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps (arXiv:2605.16928, ▲81) — Argues full-attention LLMs already contain intrinsic sparsity, then converts them to sparse-inference models in only hundreds of training steps via retrieval-head-preserving KV cache plus lightweight token indexing, achieving a 9.36× prefill speedup at 1M context and ~2.01× decode speedup near-losslessly. Why it matters: undercuts the case for expensive native sparse pretraining — you can retrofit cheap long-context inference onto existing dense models.
ACC: Compiling Agent Trajectories for Long-Context Training (arXiv:2605.21850, ▲53) — Turns the multi-turn tool-call trajectories agents already generate into long-context QA pairs, letting a fine-tuned Qwen3-30B-A3B reach 68.3 / 77.5 on MRCR / GraphWalks — comparable to Qwen3-235B-A22B — without curated long documents. Why it matters: agent runs are becoming a free, scalable substrate for long-context training data.

Hacker News

Project Glasswing: An Initial Update (371 pts · 228 cmts) — Anthropic published the first public progress report on Project Glasswing, the company’s interpretability/alignment research initiative. Sustained 228-comment discussion on HN — see Technical News below for the read; the HN signal alone is the reasonable proxy for “this matters.”
Microsoft reports AI is more expensive than paying human employees (107 pts · 28 cmts) — Fortune reads Microsoft‘s cost disclosures and internal reports (alongside Uber CTO’s budget-burn commentary) as evidence that running production AI agents now costs more than the human labor they replace. The “Microsoft acknowledges” framing in the headline is editorial — no on-record Satya/Suleyman quote — but the underlying margin signal is real enough to track.
Antigravity 2.0 Tops the OpenSCAD Architectural 3D LLM Benchmark (369 pts · 146 cmts) — A community-built benchmark from Modelrift evaluating LLMs on OpenSCAD architectural-3D code generation puts Antigravity 2.0 at #1. Niche eval, but the leader isn’t a typical general-coding favorite — useful as a model-specific-strengths data point to hold for cross-reference when other structured-spatial-output evals appear.

📰 Technical News & Releases

Anthropic ships the first Project Glasswing progress report — and HN actually reads it

Source: Anthropic

Anthropic published an initial update on Project Glasswing, its interpretability and alignment research initiative, drawing 371 points and 228 comments on Hacker News — heavy discussion for a research-blog format and one of the more substantive practitioner reads of the past month. The HN signal is itself the noteworthy part: research-direction milestones from frontier labs rarely sustain that kind of comment volume unless the technical content actually lands, and readers in the thread were trading concrete points on interpretability methodology rather than performing the usual “alignment vs. capabilities” rhetorical theater. For ML practitioners, this is the rare frontier-lab post worth opening at the source rather than waiting for the news cycle’s flattening pass.

The HN signal is the read, not the marketing

Anthropic’s research-blog posts have a wide quality variance — most are capability announcements with a thin interpretability gloss; a few are substantive direction-setting reads. A 228-comment HN frontpage with sustained technical discussion is the cheap proxy for which kind this is. The URL is the deliverable; the digest’s job is to flag it, not to paraphrase it.

Salesforce’s Agentforce: still mostly demo, still selling growth

Source: Bloomberg

Bloomberg scrutinizes Salesforce‘s Agentforce marketing claims and finds much of the showcased AI functionality is still aspirational, with little of it live in production at customer scale. The headline ARR figure — ~$800M, +169% YoY — is the Q4 FY26 print from February 2026 earnings, not a fresh disclosure; Bloomberg’s contribution is the deployment-reality check rather than new numbers. The honest cross-vendor read: this isn’t unique to Agentforce — Microsoft‘s Copilot Studio governance pivot in April 2026 was driven by the same gap between agent demos and production behavior (agents that lose relevance, loop, stall in real deployments), so treat the Salesforce-specific framing as the visible edge of a broader pattern, not a one-company failure.

Market concentration deepens — and the active-manager framing needs softening

Source: Bloomberg

The top 10 names now make up roughly 40% of the S&P 500 as AI-driven concentration deepens, with hyperscaler capex and circular AI-vendor financing pulling index returns toward a narrow set of compute and model providers. The framing implication — that AI concentration is causing active-manager underperformance — is the easy misread; SPIVA’s persistence scorecard shows ~76-79% of active large-cap managers underperformed in 2013-15 when concentration was lowest, so the honest read is that AI concentration is amplifying a pre-existing structural headwind, not creating one. For anyone modelling end-market demand or competitive capital costs, the sharper data point in the piece is the 28-session rally where 10 names drove ~69% of gains — a useful concentration anchor independent of the active-manager narrative.

Hark closes a $700M Series A at $6B — and the cap table is the story

Source: TechCrunch | BusinessWire

Hark — Brett Adcock’s (Figure, Archer) new AI lab building a personal-assistant model-plus-hardware stack — closed a $700M Series A at a $6B post-money valuation, led by Parkway Venture Capital with participation from NVIDIA Ventures, AMD Ventures, ARK, Salesforce Ventures, Qualcomm, Intel Capital, Brookfield, and Greycroft. The size at Series A is the obvious headline; the more interesting read is the strategic-investor stack — having both NVIDIA and AMD on the cap table is unusual, and the Salesforce Ventures + Qualcomm + Intel combination signals positioning for both supply-side (compute) and distribution-side (enterprise + edge) optionality before a model has even shipped. Worth tracking against OpenAI‘s hardware ambitions and Google’s Gemini-on-glass effort, but the cap-table shape is the differentiator at this stage.

Datasette Agent ships — Simon Willison’s three-year LLM project converges with Datasette

Source: Simon Willison’s Weblog

Simon Willison released Datasette Agent, a conversational AI assistant that converts natural-language questions into SQLite queries over Datasette databases, with an extensible plugin architecture for charts (Observable Plot), image generation (ChatGPT Images 2.0), and sandbox code execution (Fly Sprites). The live demo runs on Gemini 3.1 Flash-Lite for cost and speed; the plugin design also supports open-weight models like Gemma 4, so the project isn’t locked to any single provider. The framing Simon uses — “the moment LLM and Datasette finally come together” — is worth taking at face value; the NL→SQL agent space has been crowded with marketing demos, and a working plugin-architecture release from a practitioner with three years of LLM-tooling investment is the kind of primary-source build worth reading the post on rather than waiting for the news outlets to find it.

🧭 Key Takeaways

Project Glasswing is the day’s actually-read frontier-lab post. When an Anthropic research-blog item sustains 228 comments on the HN frontpage with technical-substance discussion rather than rhetoric, that’s the rare alignment-research milestone worth opening at the source. The signal isn’t the post existing — those happen weekly; it’s the practitioner discussion landing on the content rather than the framing.
The agent-demo-vs-production gap is now a cross-vendor pattern, not a Salesforce story. Bloomberg’s Agentforce piece reads as a single-company critique; the more useful framing is that Microsoft’s Copilot Studio April governance pivot and Salesforce’s growth-vs-deployment gap are the same underlying signal. Treat any agent-deployment number from any vendor with the same scrutiny going forward.
The Aider polyglot board is now in its fourth consecutive week with no Gemini 3.5 Flash entry. Google shipped Flash four weeks ago; Aider — the most-cited practitioner code benchmark — still hasn’t validated the I/O claims independently. That gap is itself a data point, and the longer it persists, the more it shifts the burden of proof.
Claude Code is in an accelerated-but-not-sustainable release week. Four releases in three days (147 → 150), including a same-day infra-only bump, is the burst pattern, not a new normal — the trailing 11-day rate is ~0.9/day. The substantive v2.1.149 features (per-category /usage cost breakdown, GFM checkboxes, sandbox + PowerShell hardening) are worth a claude update; the cadence narrative isn’t.
The fresh-research diet today is RLVR + sparse-attention transfer + agent-trajectory long-context training. All three top HF papers (DelTA, Full Attention Strikes Back, ACC) hit the same theme: practitioners squeezing more out of existing dense base models with cheap post-training adaptations rather than waiting for the next pretraining generation. The dense-base + cheap-adapter stack keeps proving more interesting than the bigger-base narrative the news cycle prefers.

Generated on 2026-05-23 by Claude