Daily Digest · Entry № 72 of 79

AI Digest — May 18, 2026

Apple's standalone Siri app — partly powered by Gemini through Private Cloud Compute — formalises the post-OpenAI integration era as the Musk v. Altman jury begins deliberations.

AI Digest — May 18, 2026

Your daily deep-dive on AI models, tools, research, and developer ecosystem news.


🔖 Project Releases

Claude Code

No new release today. v2.1.143 (2026-05-15-AI-Digest coverage; release notes 2026-05-15 22:28 UTC) — worktree.bgIsolation: "none" for repos where worktrees are impractical, plugin dependency enforcement on enable/disable, eight new claude agents dispatch flags (--add-dir, --settings, --mcp-config, --plugin-dir, --permission-mode, --model, --effort, --dangerously-skip-permissions), and per-turn projected context cost in the /plugin browse pane — remains current. Now tombstoned across three consecutive digests.

Beads

No new release this week. v1.0.4 (2026-05-09) — Linear OAuth client-credentials authentication, batch issueBatchCreate/issueBatchUpdate for ~50x efficiency, idempotency markers for duplicate-create prevention, -C <path> flag for running commands from a different directory, and --reason-file mirroring --body-file for close operations — remains current. Tombstoned across five consecutive digests now.

OpenSpec

No new release this week. v1.3.1 (2026-04-21, “Path & Telemetry Fixes”) — realpath resolution for symlinked and case-insensitive paths, correct glob expansion in apply instructions, requirement detection inside fenced code blocks, and PostHog telemetry timeout handling for firewalled networks — remains current. Tombstoned again from 2026-05-17-AI-Digest.


🧵 From the Community (r/LocalLLaMA & r/MachineLearning)

Four-way local-inference benchmark: M5 vs DGX Spark vs Strix Halo vs RTX 6000

r/LocalLLaMA — “M5 vs DGX Spark vs Strix Halo vs RTX 6000 — real hardware benchmarks”

The author physically co-located all four platforms, ran three days of standardised parallel tests, and published the harness to a repo. Memory bandwidth — the variable that ends most local-inference debates — splits cleanly: RTX 6000 at ~1,800 GB/s, M5 Max at ~546 GB/s, DGX Spark at ~273 GB/s (per vendor specs; the post quotes slightly different round numbers). For anyone choosing where to run quantised 27B–70B models, this is the most rigorous apples-to-apples comparison published this week.

85 GPU-hours on abliteration forensics for Qwen3.6-27B

r/LocalLLaMA — “85 GPU-hours comparing 5 abliteration methods on Qwen3.6-27B — Abliterlitics”

Abliteration — the family of weight-level techniques for removing refusal behaviour from instruction-tuned models — finally has an open-source forensics toolkit. The author compared five variants against the base model on benchmark scores, safety evaluations, distribution shift, and weight-level diffs. The post body is 13,624 chars and the methodology section is unusually careful; if you are fine-tuning or deploying uncensored local models, this is the first quantitative guide on which abliteration variant degrades capability least.

llama.cpp PR #23198 lands MTP prompt-decode optimisation

r/LocalLLaMA — “llama.cpp avoid copying logits during MTP prompt decode”

A merged llama.cpp pull request eliminates a logit-copy step during multi-token-prediction prompt processing. The thread is one sentence — “time to update your llama.cpp → improved prompt processing speed” — but the PR itself is real and the practical implication is direct: anyone running MTP-enabled models (Qwen3.6 with draft heads, for example) should pull and rebuild.

ROCm 7.13 nightly adds Strix Halo optimisations

r/LocalLLaMA — “ROCm 7.13 nightly adds Strix Halo optimizations”

The tech-preview build (TheRock on GitHub) ships new kernel optimisations for the Ryzen AI Max 300 “Strix Halo” APU and open-sources the ROCprof Trace Decoder. Read alongside the four-way hardware benchmark above: AMD’s APU is the cheapest serious contender in that comparison, and the ROCm gap has been the main reason it underperforms its memory-bandwidth promise.


📰 Technical News & Releases

Apple’s Standalone Siri App Will Auto-Delete Chats, Routes Gemini Through Private Cloud Compute

Source: Bloomberg | TechCrunch

Apple’s standalone Siri app in iOS 27 will ship with auto-deleting conversation history — options for 30 days, one year, or indefinite retention — and is expected to debut as a public beta at WWDC in June. Mark Gurman’s reporting confirms what TechCrunch corroborated independently the same day: the underlying model substrate is Google Gemini, but Gemini queries are routed through Apple’s Private Cloud Compute rather than going directly to Google’s infrastructure.

Read this as Apple making privacy structural, not optional

Apple’s competitors offer privacy as opt-in settings: ChatGPT has Memory-off mode, Claude does not train on paid/API prompts by default, both expose deletion controls. Apple’s pitch is that the same restrictions live at the OS layer with auto-delete defaulting on. The iOS 27 Extensions framework — which routes some queries to OpenAI, Anthropic, and Google directly — partially undermines that framing, but the default path is the one most users will be on.

This is also the cleanest signal yet that the OpenAI–Apple integration the two companies announced in 2024 is no longer the centrepiece. The January 2026 Apple–Google Siri deal was the inflection point; today’s reporting confirms the new Siri’s identity is being built around Gemini-plus-privacy, with ChatGPT integration retained but secondary.

Musk v. Altman Closing Arguments Wrap; Jury Begins Deliberations Today

Source: TechCrunch

Closing arguments in the Oakland trial concluded on May 14. As of this morning, the nine-member jury (six women, three men) begins deliberation in front of Judge Yvonne Gonzalez Rogers, who retains final authority — the jury’s verdict on misappropriation and breach claims is advisory. The trust question carries the day: whether the jury believes Sam Altman’s account of OpenAI‘s shift from nonprofit to a $852B valuation for-profit entity, or accepts Elon Musk’s argument that the roughly $38M he personally donated to the nonprofit formed a charitable trust requiring perpetual nonprofit status.

Watch the valuation cited

Press shorthand of “$850 billion for-profit giant” is accurate but flattens the structure: OpenAI’s $852B post-money valuation was set in a $122B primary round in March 2026 co-led by SoftBank, a16z, Amazon, and Nvidia. The trust-misuse claim is about the original ~$38M; the valuation just sharpens the optics.

China’s Energy Capacity Build-Out Becomes the AI Compute Story

Source: Bloomberg | Bloomberg

Two Bloomberg pieces this weekend pulled the same thread from opposite ends. The May 17 video frames energy capacity — not chips or software — as the new front in the US–China AI race, citing 429 GW of net new generation capacity added in China in 2024 versus roughly 51 GW in the US (the 2024 numbers are total all-source additions, dominated by solar and wind). Hank Paulson’s commentary anchors the framing: US electricity shortages could become a hard ceiling on AI ambitions even if the US retains a model-quality lead.

The May 16 piece operationalises the same thesis from inside the grid. Three large data-center clusters — China Unicom’s Shaoguan campus and China Mobile’s Guangzhou and Zhanjiang data centers, all majority state-owned — entered Guangdong’s electricity spot market on May 14 via a provincial virtual-power-plant platform, becoming the first Chinese data centers to buy at real-time prices. That is the structural shift: AI compute demand becoming an active participant in grid balancing rather than a passive load.

Energy is a medium-term constraint, not the binding one today

The binding constraint right now is chips and interconnect, not megawatts — China’s domestic AI-chip output remains a small fraction of US capacity and H20 export restrictions are still active. The Paulson framing reads correctly as “the lead China is building if chip gaps narrow,” not “the constraint biting today.”

The Detroit Three Enter the AI Skills Swap

Source: TechCrunch

TechCrunch’s Mobility newsletter on May 17 framed General Motors’ recent cut — more than 10% of its IT workforce, roughly 600 salaried employees in Austin and Warren — as the leading edge of an industry pattern. The data backs that read for Detroit specifically: GM, Ford, and Stellantis have collectively shed 20,000+ white-collar roles (around 19% of peak workforce) while posting nearly 400 AI-related openings among 2,000+ combined hires. GM has approximately 80 AI-focused positions open against the 600 cut; this is a skills swap, not a 1:1 headcount replacement.

The pattern is Detroit-specific, not global automotive

Toyota grew its US white-collar headcount 31% from 2020–2025 and the European majors have not announced comparable restructurings. Read this as the Detroit Three repricing their software stack, not “AI is coming for automotive” as a global thesis.

Recursive Superintelligence Exits Stealth With $650M at $4.65B Valuation

Source: TechCrunch

Richard Socher’s new venture launched publicly on May 14 with $650M raised in a single equity round at a $4.65B post-money valuation, led by GV and Greycroft with Nvidia and AMD Ventures participating. The pitch is AI systems that autonomously identify their own architectural weaknesses and redesign themselves — distinct from RLHF fine-tuning loops or weight-only self-distillation — with the team citing Socher’s prior Salesforce Einstein and NLP research as foundational.

The framing is novel as commercial positioning, but the underlying techniques sit in continuity with a decade of meta-learning, NAS, and AutoML work — and Google’s AlphaEvolve announcements last year already pushed the engineering-milestone framing for self-improvement. The real test is whether the staged 2026 roadmap produces something demonstrably outside that lineage.

SOOHAK Benchmark Exposes the Unsolvable-Problem Blind Spot

Source: The Decoder | arXiv 2605.09063

A new benchmark from CMU, EleutherAI, and Seoul National University — built with IMO-medalist input — intentionally seeds the problem set with unsolvable problems. On solvable items the leaderboard reads as expected: Gemini 3 Pro ~30%, GPT-5 ~26%, Claude Opus 4.5 ~10% (Avg@3). The diagnostic is the refusal axis: no model clears 50% on recognising that a problem is flawed or unsolvable (best score ~49%), and the paper finds scaling does not move the refusal number meaningfully.

The implication for practitioners is the methodology, not the leaderboard. Confident wrong answers on unsolvable problems is the failure mode that aggregate accuracy benchmarks systematically hide. SOOHAK is the first published benchmark we’ve seen that turns refusal-quality into a first-class metric rather than a footnote.

OP-Mix Unifies Pretrain, Continual, and Instruction-Tune Data Mixing

Source: arXiv 2605.15220

Hu, Gandhi, Kyunghyun Cho, Linzen, and Sharma introduce OP-Mix, a single data-mixing algorithm using low-rank-adapter interpolation that covers pretraining, continual learning, and instruction tuning. The paper reports a 6.3% average perplexity improvement over training without mixing, with 66% less compute than retraining from scratch and 95% less than on-policy distillation.

The practitioner read: data-mixing pipelines have historically required separate proxy models and search procedures for each phase. OP-Mix collapses that into one method. If the result reproduces, it is the kind of efficiency win that compounds across every training run.


🧭 Key Takeaways

  • Apple’s privacy pitch for Siri is structural, not marketing — but the iOS 27 Extensions framework partially routes around it. Auto-delete defaulting on at the OS layer is a real architectural differentiator versus ChatGPT/Claude’s opt-in controls. The caveat: the Extensions model still pipes queries to OpenAI, Anthropic, and Google directly when invoked.
  • The China energy story is the lead being built, not the constraint biting now. Watch the gap: 429 GW added in 2024 versus ~51 GW for the US is real, and Guangdong’s data-center spot-market entry shows the operational machinery coming online. Today’s binding constraint is still chips and interconnect.
  • Musk v. Altman jury deliberation began this morning. The verdict is advisory; Gonzalez Rogers holds final authority on liability. Whatever the outcome, the trial has anchored OpenAI’s $852B valuation and its nonprofit-to-for-profit history into the public record at a level of detail that earlier reporting never reached.
  • The Detroit Three are repricing their software stack, but the “AI is coming for automotive” framing is overbroad. Toyota’s 31% US white-collar growth (2020–2025) and the absence of comparable European restructurings cut against the global-pattern claim. GM, Ford, Stellantis is the actual story.
  • SOOHAK names the failure mode that aggregate benchmarks have been hiding. Confident wrong answers on intentionally unsolvable problems is the practitioner-actionable axis; scaling does not fix it.
  • OP-Mix is the most directly useful single paper of the week if it reproduces. One method across pretrain, continual, and instruction-tune phases with 66–95% compute savings is the kind of efficiency claim that warrants a careful replication before you build it into a pipeline.

Generated on 2026-05-18 by Claude