Daily Digest · Entry № 92 of 92

AI Digest — June 7, 2026

[[Anthropic]] puts a hard number on dogfooded coding agents — **>80% of code merged into its own repo in May was Claude-authored** (low single digits pre-Feb 2025), with engineers shipping ~8× more code/day — landing the same week Sen. Jim Banks (R-IN) flags recursive self-improvement on the record as a national-security threshold and [[Sakana AI]] stands up a dedicated RSI Lab in Tokyo. [[Google]] separately commits ~$29B over 32 months to lease ~110K NVIDIA GPUs from [[SpaceX]] sited at [[Colossus 1|xAI's Colossus]] data centers, and [[OpenAI]] ships ChatGPT memory "Dreaming V3" with recall climbing 41.5%→67.9%→82.8% and a ~5× compute cut that unlocks memory for Free users.

AI Digest — June 7, 2026

Your daily deep-dive on AI models, tools, research, and developer ecosystem news.


🔖 Project Releases

Claude Code

Two more fixes-only point releases capping yesterday’s substantive v2.1.166. Claude Code v2.1.167 (2026-06-06 01:33 UTC) and v2.1.168 (2026-06-06 23:41 UTC) both ship as bare “bug fixes and reliability improvements” tags with no public changelog beyond the headline. The substantive features — the new fallbackModel setting (up to three fallbacks tried in order when the primary is overloaded; --fallback-model now also applies to interactive sessions), glob patterns in deny rules ("*" denies all tools), hardened cross-session messaging (SendMessage relays lose user authority — receivers refuse relayed permission requests; auto-mode blocks them outright), MAX_THINKING_TOKENS=0 actually disabling thinking on Anthropic-API models that think by default, and the pre-download version announcement on claude update — all landed in v2.1.166 and are covered in 2026-06-06-AI-Digest. The cadence (three tags in 48 hours, two of them fixes-only) is the read: Anthropic is on a “ship the substantive change, then bake out the regressions on the same day” pattern rather than gating point releases.

Beads

No new release. Beads v1.0.5 (2026-05-29, pre-release) remains the stuck tag flagged across the last six digests — nine days out, with the 🚨 do not upgrade gate still in place because migration 0043 can silently and unrecoverably break multi-machine bd dolt sync once both clones upgrade (issue #4259). Homebrew remains reverted to v1.0.4; the announced fix-forward v1.0.6 has still not shipped. Status unchanged from 2026-06-06-AI-Digest — the next tag is still the only signal worth watching.

OpenSpec

No new release. OpenSpec v1.4.1 (2026-06-03) — the single-issue “Update Fix” patch that restored openspec update for projects carrying their own workspace.yaml (e.g. Dagster) — remains the latest. Quiet four days; nothing new to report against 2026-06-04-AI-Digest‘s coverage.


🧵 From the Community

Aider polyglot top-5 (fetched 2026-06-07): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3%. Identical to the 2026-06-06-AI-Digest snapshot — the publicly published board has now been frozen at the 2025-11-20 refresh for over six months and is functioning as a reference floor, not a leading indicator. The honest caveat: several Jun 2026 open-weight releases (DeepSeek V4-Pro, Qwen3-Coder-Next, MiniMax M3) have not been benchmarked here yet, and at least one community-maintained Aider mirror puts DeepSeek-V3.2-Exp into the top-5 at ~74%. Read the public top-5 as a slow-moving anchor for closed-reasoning quality, not as today’s open-vs-closed scorecard.

Papers

  • The Self-Correction Illusion: LLMs Correct Others but Not Themselves (arXiv:2606.05976, Chen, Su, Chiang) — Across 13 model-domain combos in seven model families, relabeling an identical erroneous claim from “internal thought” to “external source” lifts explicit-correction rate by 23–93 pp (10/13 combos significant at p<0.001). Why it matters: argues the long-standing “LLMs can’t self-correct” finding is a chat-template artifact, not a cognitive limit — fixable by restructuring the prompt that wraps the model’s own output before re-evaluation, no retraining needed. Directly relevant for anyone wiring reflection or critic loops.
  • Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution (arXiv:2606.06492, ▲63) — A hypernetwork generates repo-specific LoRA adapters with zero inference-time token overhead; an “Evo” variant maintains a GRU-backed adapter updated per code diff, matching per-repo LoRA on static code (66.2% in-repo EM) and beating shared LoRA by +5.2 pp on evolving codebases. Why it matters: a third path between RAG and per-repo fine-tuning for the repository-context problem that dominates real coding-agent costs.
  • Latent Reasoning with Normalizing Flows (NF-CoT) (arXiv:2606.06447, Tu, Fu, Yu, Tang, Kang, Qin, Y. Zhang, Jiatao Gu) — Performs intermediate reasoning in compact continuous latent states via normalizing flows while preserving left-to-right generation and KV-cache compatibility; reports code-generation gains at lower intermediate-step cost than explicit CoT. Why it matters: a credible, infra-compatible “thinking without tokens” approach — note Jiatao Gu as senior author.

Hacker News

  • Harness engineering: Leveraging Codex in an agent-first world (129 pts · 79 cmts) — OpenAI engineering post arguing that scaffolding around the model (the “harness”) is now the dominant lever for agent quality, framing harness design as a first-class engineering discipline. Why it matters: pair this with Anthropic’s recursive-self-improvement post from earlier this week (see Technical News below) — both frontier labs are now publishing the same diagnosis that the model alone has stopped being the bottleneck, and the harness around it is.
  • Google to pay SpaceX $920M a month for compute capacity at xAI data centers (191 pts · 772 cmts) — CNBC report that Google has committed to a 32-month, ~$29B GPU-leasing arrangement with SpaceX, with the underlying capacity (~110,000 NVIDIA GPUs) sited at xAI’s Colossus data centers. Why it matters: the third hyperscaler-grade lab in three weeks publicly leasing serving capacity from a Musk vehicle — a continuation of the cross-stack-leasing pattern, not a new posture (see Technical News for the structure and counterparty story).

📰 Technical News & Releases

Anthropic Puts a Number on Its Own Coding Loop — >80% of Code Merged in May Was Claude-Authored

Source: Anthropic Institute | VentureBeat | Fortune

Anthropic‘s “When AI builds itself” post (Marina Favaro and Jack Clark, published this week as part of the Anthropic Institute’s recursive-self-improvement series) puts the first hard internal number on dogfooded coding agents: >80% of code merged into Anthropic’s own repository in May 2026 was Claude-authored, against a low-single-digits baseline before the Claude Code preview shipped in February 2025; engineers are reportedly merging roughly 8× more code per day versus 2024. The wording on Anthropic’s page collapses “merged” and “authored” — they own that flattening at source — and the comparison is to “production code,” which still has the human merge-review loop wrapped around it.

The framing the post puts these numbers inside is RSI as a forward-looking safety category, paired with last week’s call for a coordinated frontier-lab pause (2026-06-05-AI-Digest). The 80% figure is the substantive new datum; the pause-call posture was already in the corpus.

What the 80% does and doesn’t say

This is Anthropic’s own repo with Anthropic’s own engineers using Anthropic’s own tools — a ceiling under maximally favourable dogfooding conditions (modern Python/TS stack, no large legacy code, AI-native team), not a benchmark for what Claude Code will land in an arbitrary enterprise codebase. It is, however, the strongest first-party data point yet on how far a frontier lab’s own dev loop has been re-shaped by its own coding agents — and the number worth carrying forward when sizing realistic adoption targets is “what fraction of your merge volume can the agent draft under review,” not “will 80% generalize.”

Same week, Sen. Jim Banks (R-IN) (Bloomberg, Jun 5) backed Trump’s recent AI cybersecurity executive order and explicitly flagged AI systems that “do AI R&D” as a national-security threshold the US must hit before the PRC. Read alongside Anthropic’s post and the Sakana AI RSI Lab announcement below, RSI as a vocabulary now spans a frontier lab’s own engineering retrospective, a sitting US senator’s oversight pitch, and an independent commercial lab’s strategic positioning — three vectors in a single week, which is the load-bearing signal rather than any one of them alone.

Sakana AI Stands Up a Dedicated Recursive-Self-Improvement Lab

Source: Sakana AI | The Decoder

Tokyo-based Sakana AI — founded by Transformers co-author Llion Jones and ex-Google Brain / ex-Stability David Ha — announced a new dedicated Sakana AI RSI Lab organised around recursive-self-improvement as the company’s path to competing with frontier labs without matching their capex envelope. The lab’s positioning explicitly cites Sakana’s earlier work (LLM², Darwin Gödel Machine) and its “AI Scientist” paper, which Nature published in March 2026. No fresh funding was disclosed alongside the lab announcement — this is research-strategy positioning, not a capital event.

The thesis Sakana is pitching — that an RSI-shaped research bet can substitute for hyperscaler-scale training budgets at a non-frontier lab — is the contrarian counterpoint to this week’s frontier-lab capex stories. Whether it works is an empirical question the lab now has to answer; what it does to the conversation today is push the second non-Anthropic data point in a week (alongside Recursive Superintelligence‘s May emergence, 2026-05-16-AI-Digest) into the corpus’s RSI thread, sharpening the read that RSI is no longer just an Anthropic-and-frontier-safety topic.

Sriram Krishnan Departs White House AI Advisor Role — Plans Independent Tech-Policy Institution, Not a16z Return

Source: TechCrunch | Washington Post | CNBC

Sriram Krishnan — the Andreessen Horowitz partner who became a senior White House AI policy advisor in late 2024 and is widely credited as the architect of the American AI Action Plan — is leaving the administration at the end of June after ~18 months. According to TechCrunch and corroborating reports (WaPo, CNBC, The Information), he plans to launch an independent tech-policy institution after a short break, and will continue to advise the White House externally; he is not returning to a16z’s investment side. A successor has not been named.

The substantive read isn’t the personnel story — it’s the policy execution speed. Krishnan is the most fluent industry-to-administration bridge the current White House had on AI; the Action Plan’s compute-export carve-outs, federal procurement levers, and frontier-model reporting threads ran through him. Expect a wobble — measurable in weeks, not days — on Action Plan implementation timelines until the seat is filled, and watch whether his outside institution becomes the de facto policy shop the administration borrows from. The plain-English “what changed for you today” is: the Action Plan’s deliverables are now a hand-off in motion, not a sustained execution effort.

Google–SpaceX 32-Month, ~$29B GPU Lease — Capacity at xAI’s Colossus, Counterparty is SpaceX

Source: CNBC | TechCrunch

Google has reportedly agreed to pay SpaceX $920M/month for 32 months (Oct 2026 → Jun 2029) — roughly $29.4B total — to lease compute capacity that sits at xAI’s Colossus data centers, comprising about 110,000 NVIDIA GPUs. The contractual counterparty is SpaceX (which operates the underlying capacity), not xAI directly; Google framed it as “bridge capacity” for Gemini Enterprise demand. This sits adjacent to Anthropic‘s prior full lease of Colossus 1 capacity from 2026-05-08-AI-Digest — Google and Anthropic are now both renting serving capacity from a Musk vehicle to feed competing frontier-model demand.

Structure vs. headline

The accurate framing is “cross-stack compute leasing is now a routine structure” (Microsoft has leased the abandoned Texas Oracle/OpenAI site; OpenAI rents from CoreWeave for ~$22.4B; Anthropic rents from SpaceX). The novelty here is the counterparty, not the structure — Google contracting with a Musk-controlled landlord that runs xAI’s training cluster is the line item worth tracking the week before SpaceX’s reported IPO window. Read this less as “Google is compute-constrained vs xAI” and more as “spare Colossus capacity has now become a salable serving-side product.”

DoubleLine and Oaktree Position for an AI-Capex Credit Downturn — Joining a Lengthening List

Source: Bloomberg

Two of the largest US credit managers — DoubleLine and Oaktree — are publicly positioning their books for a scenario where the AI-capex boom turns into a credit downturn, citing data-center overbuild risk and long-dated bonds funding gear that will be obsolete well inside the maturity schedule. DoubleLine portfolio manager Robert Cohen told Bloomberg that bond valuations aren’t yet frothy but “will undoubtedly” reach those levels, and put a “maybe 100%” probability on AI-driven credit-bubble formation forward. The actual positioning is defensive credit selection — buying instruments structured to survive a downturn — not CDS or outright shorts; no fund-level $-amount is disclosed.

The right calibration is breadth, not first-mover. PIMCO has been publishing on AI-credit risk for months (the $27B Meta Hyperion deal, the $14B Oracle / Stargate talks, the firm’s “AI Credit Expansion” notes), and Apollo’s $3.5B SpaceX-Valor unitranche from February showed structured AI-infra positioning already in motion. DoubleLine and Oaktree publicly joining the list this week is the n-th data point — what’s signal-worthy is that the breadth of named credit managers now on the record about AI-infra overbuild is the largest it has been, not that DoubleLine and Oaktree are first-movers.

OpenAI Ships ChatGPT Memory “Dreaming V3” — 41.5% → 67.9% → 82.8% Factual Recall, Free Users Get Memory

Source: OpenAI | Implicator

OpenAI rolled out ChatGPT memory “Dreaming V3” — an asynchronous background process that synthesises and revises memories across conversations without explicit user instruction (e.g. an old “going to Singapore in July” note auto-rewrites to “went to Singapore in July 2026” once the trip date passes). OpenAI’s published numbers put factual recall on its internal eval at 41.5% (2024) → 67.9% (2025) → 82.8% (Dreaming V3) — a three-point series with no methodology published — paired with a claimed ~5× compute reduction that unlocks memory for Free users for the first time. US Plus/Pro rollout began June 4 and is now expanding.

The practitioner-relevant read is the architectural pattern, not the recall number. Dreaming V3 is the first production deployment of “sleep-time compute” on memory at consumer scale — an offline reconciliation job that runs over the chat log between sessions, rewriting derived memory state when new evidence arrives. That pattern is directly portable to anyone building an agentic memory layer; the “memory updates happen during dreaming, not during the user turn” decoupling is the load-bearing design choice. Treat the 82.8% as OpenAI’s internal eval (no third-party comparison yet) rather than a settled benchmark.


🧭 Key Takeaways

  • Recursive self-improvement is a vocabulary now spanning labs, US policy, and independent labs in the same week. Anthropic’s “When AI builds itself” post lands the >80% Claude-merged / 8× engineer throughput numbers as the first-party datum; Sen. Jim Banks puts RSI on the record as a national-security threshold; Sakana AI stands up a dedicated RSI Lab around the thesis that RSI can substitute for hyperscaler capex. Three independent vectors converging is the load-bearing signal — not any one of them on its own.
  • The 80% Claude-merged number is the ceiling under ideal dogfooding conditions, not your enterprise baseline. Anthropic’s own repo, Anthropic’s own engineers, Anthropic’s own tools. The number worth carrying into your own planning is “what fraction of merge volume can the agent draft under review,” not the 80%. Still the strongest first-party data point yet on how a frontier lab’s dev loop has been reshaped.
  • The frontier-lab capital cycle from 2026-06-06-AI-Digest is now showing downstream signals — credit hedging, cross-stack compute leasing — not new headline raises. DoubleLine/Oaktree positioning joins PIMCO and Apollo on the public record; Google–SpaceX–Colossus is the n-th node in the cross-stack-leasing pattern. The thesis is continuation, not a new top-line.
  • Frontier labs are publishing the same diagnosis on the harness vs the model. OpenAI’s “Harness engineering” post on HN today and Anthropic’s RSI post both name the scaffolding around the model — tool use, planning, validation loops — as the dominant lever now that base-model quality has compressed. The practitioner takeaway: harness investment compounds, model swaps don’t.
  • The public Aider polyglot board has been frozen for six months and is now functioning as a reference floor, not a today-signal. Unevaluated June 2026 open-weight releases (DeepSeek V4-Pro, MiniMax M3, Qwen3-Coder-Next) and DeepSeek-V3.2-Exp’s ~74% on a community-maintained Aider mirror are the live edge — the public top-5 will keep showing gpt-5 (high) at 88.0% until the page refreshes.

Generated on 2026-06-07 by Claude