Simon Willison

Overview

Simon Willison is a long-running practitioner-voice blogger whose daily LLM digest at simonwillison.net the AI Digest tracks as a primary-source stream alongside Andrej Karpathy’s X/YouTube output. Willison’s posts tend to crystallise practitioner consensus a beat ahead of mainstream coverage — the “best model crown changed hands five times in six months” framing, the “Claws” product-category naming, and the “coding agents have crossed the daily-driver reliability bar” claim all originated as Willison observations before being picked up elsewhere.

Timeline

2026-05-02-AI-Digest — Willison publishes an end-to-end iNaturalist sightings explorer written entirely on a phone via Claude Code for web; the “build it in an afternoon, on a phone, while waiting” framing is the corpus’s load-bearing data point that individual-developer productivity ceiling has moved further than headline model-capability releases suggest.
2026-05-10-AI-Digest — Willison amplifies Thariq Shihipar’s argument that asking Claude to emit HTML — not Markdown — unlocks SVG diagrams, interactive widgets, in-page navigation, and other rendering the Markdown surface cannot carry. Developer-tooling-affordance discovery, not a new model capability.
2026-05-11-AI-Digest — Willison publishes a piece arguing “vibe coding” and “agentic engineering” are converging on the same practice; the gap between casual prototypers and professional agentic engineers is narrowing faster than either community acknowledges.
2026-05-19-AI-Digest — PyCon US 2026 lightning talk’s annotated slides publish: the “best model crown changed hands five times” framing across Anthropic, OpenAI, and Google in six months (Willison’s own “depending mostly on vibes” hedge), with Claude Opus 4.5 holding longest; coding agents moved from “often-work to mostly-work”; the “Claws” category has consolidated; Chinese open-weights (GLM-5.1, Qwen 3.6-35B-A3B) “wildly outperforming expectations” on laptop-local inference.
2026-06-09-AI-Digest — Willison’s WWDC 2026 write-up flags two practitioner-relevant details the keynote framing under-sold: vision LLMs in Apple‘s new architecture may finally let Siri operate apps without per-app developer integration (computer-use-style screen reading rather than the App Intents glue Apple has spent two years pushing), and the new Core AI library opens on-device hardware to developer-owned models with PyTorch integration. Paired with AI Weekly’s iOS 27 AI Extensions framing, the corpus’s corrected frame becomes Gemini becomes Siri’s default backbone while iOS 27 Extensions keeps Apple multi-sourced at the user layer. Willison’s framing is his own read, not established consensus — but as a directional signal on where on-device agentic stacks are headed, it’s the right read.
2026-05-20-AI-Digest — Willison publishes the annotated slides from his PyCon US 2026 lightning talk as a five-minute compressed retrospective covering Nov 2025 through May 2026 — the corpus is going to lean on this synthesis for the next several weeks. Two load-bearing claims: coding agents have crossed the “daily-driver reliability” bar via late-2025 RL work, and ~20GB open-weight models running locally on laptops now compete with proprietary frontier models on practical workloads (GLM-5.1 and Qwen 3.6-35B-A3B at 20.9GB quantised are his cited reference points). The “best-model crown changed hands five times in six months” framing extends into the new post unchanged.
2026-06-12-AI-Digest — Willison’s two-post hands-on of Claude Fable 5 is the cleanest independent practitioner read so far. June 9 first-impressions post: knowledge breadth and coding “feel big” — Willison shipped llm 0.32a3 mostly via Fable including a CPython-WASM sandbox wheel he had not previously built — but flags slow and expensive (one Datasette Agent session burned $99.26 / 78.2M tokens, 89.9% of his daily token spend). June 11 follow-up sharpens the critique: Fable 5’s default posture is to volunteer follow-up actions the user did not ask for — “relentlessly proactive” — useful in interactive agent loops, friction in disciplined CLI/scripted use. The framing is an individual practitioner observation, not yet corroborated cross-user pattern. Paired with Anthropic‘s same-day apology for the undisclosed distillation-defence guardrail, the day-3 Fable picture is capable, expensive, and over-shipped on autonomy by default.
2026-06-10-AI-Digest — Willison’s ~5.5-hour hands-on with Claude Fable 5 (posted June 9, Anthropic‘s launch day) calls Fable 5 “something of a beast” with notably broader knowledge than predecessors — and lands the $10/$50 per M tokens pricing point (≈ 2× Opus 4.8) as the real positioning. Same day, Willison mirrors Jon Ready’s widely circulated post arguing the Fable 5 terms permit silent degradation of help on competitor apps without notifying users — turning the capability story into a trust-and-alignment thread within hours of launch (1968 / 1525 on HN for the Anthropic page; 649 / 316 for the Ready post). Paired with Andrej Karpathy‘s same-day Jevons read; the practitioner vibe-check has consolidated around “real capability bump” within 24h, but field productivity studies have historically come in well below the launch-day “10X” framing.
2026-06-17-AI-Digest — Willison’s June 16 post on the Bloomberg-published Lutnick letter is the practitioner anchor of today’s lead Technical News story alongside the Bloomberg piece itself. The post elevates Kate Moussouris’s open letter making the defender-side counter-argument: defenders use frontier models for everyday “fix the bugs in a file, explain why the fix matters, write tests that confirm the patch works” loops that the directive’s foreign-national restriction blocks wholesale even when no offensive use is in scope. Today’s body flags that the Willison/Moussouris framing conflates the trigger-prompt question (was a defensive request the proximate cause?) with the export-control question (the directive restricts access regardless of prompt intent) — both real, not the same argument.
2026-06-26-AI-Digest — Willison amplifies Bruce Schneier’s “AI as the deployer’s agent” liability frame — the legal position that AI agents should be treated as agents of the person or organization that deploys them, not as third-party tools the deployer can disclaim and not as autonomous actors with their own legal personality. Schneier’s argument is normative (“companies should be as liable for AI-generated mistakes as for human-employee ones, otherwise we incentivize cheaper-but-worse automation with no accountability”); Willison’s amplification adds the developer-facing implication that indemnification clauses in model-vendor contracts will get pulled into the liability question as deployers start arguing their AI-vendor is the upstream responsible party. The framing the digest is careful to hold: this is two voices (one cryptographer-policy commentator, one developer-blogger) making a normative argument, not an emerging legal consensus — no court ruling, regulation, or industry policy statement has adopted the frame yet. The 60-day watch item is whether a parallel argument shows up in an EU AI Act enforcement action, a U.S. tort filing against a deployer, or model-vendor TOS revisions.
2026-06-27-AI-Digest — Willison surfaces the rest of the GPT-5.6 family that landed alongside GPT-5.6 Sol: Terra at $2.50 / $15 (half the GPT-5.5 price) and Luna as a new cheap tier at $1 / $6. The corpus carries Sol’s $5/$30 base-tier pricing as the headline, but Willison’s pricing read across the whole family is the more durable competitive lever — the per-token pricing tier expansion is the part that touches every API customer regardless of whether they ever cross the Sol gating threshold. Practitioner-voice positioning of the Terra/Luna tiers as the structural news of the OpenAI release, distinct from the Sol-vs-Mythos headline benchmark comparison.
2026-07-01-AI-Digest — Willison ships shot-scraper 1.10 on June 30 with a new shot-scraper video subcommand that records browser interactions from a YAML storyboard via Playwright’s screencast — the stated use case being coding agents attaching polished video proofs to their PRs rather than static screenshots or wall-of-text logs. The storyboard format is committed alongside the code the agent ships, so the demo is reproducible from the same PR that carries the change. The corpus framing the digest carries: a small tooling addition to a well-established scraping utility, but structurally the “video proof-of-work” primitive the agent-review workflow has needed since agent PRs started outpacing what human reviewers can eyeball at scale — parallel to the Claude Code 2.1.x admin-posture buildout on the enterprise side of the same problem.
2026-07-10-AI-Digest — Willison’s independent read on GPT-5.6 Sol GA complicates the “back at the frontier alongside” framing. Sol scores 53.6 on Agents’ Last Exam vs Claude Fable 5‘s 40.5, but Willison writes “so far it hasn’t struck me as better than Fable at the kind of complex coding tasks I’ve been using”; SWE-Bench Pro puts Fable at 80% against Sol’s 64.6% (with OpenAI‘s response attacking the benchmark’s validity rather than the number). The Aider polyglot top-5 still shows GPT-5 (May 2026) at rank 1 with 88.0% — Sol did not displace it. The corpus framing the digest anchors on Willison’s read: “price-and-latency re-entry, not a capability upset” — OpenAI restored the axis where it has always led (pricing surface, tier proliferation, API-consumer breadth) while Anthropic retains the coding-quality lead per independent practitioner test. Extends the practitioner-crown-hands-changing thread by anchoring today’s read as another Willison-first-then-mainstream synthesis: the “not obviously better than Fable at complex coding” line will likely be the durable practitioner reference point for the GPT-5.6 GA.
2026-07-13-AI-Digest — Willison posts a short but crisp piece arguing LLM agents must never be the Directly Responsible Individual (DRI) on a project — grounded in the IBM 1979 training slide (“A computer can never be held accountable, therefore a computer must never make a management decision”). The post threads the point through modern agent tooling: an LLM agent can execute, review, propose, and remind — but the accountability endpoint has to be a person. The framing is a crystallisation of decades-old consensus, not a novel thesis — Willison himself flags the IBM slide as “legendary” and is explicit that he’s restating the principle for the agent era. The reason it lands today is timing: it drops the same week Anthropic ships a Claude Code browser (see today’s Project Releases), Meta launches Muse Spark 1.1 for agentic coding, and Microsoft cleaves Copilot along a commodity-versus-frontier line. Structural read worth carrying: the first framing in the corpus that inverts the accountability question from downstream-of-capability to input-constraint-on-agent-design — agents that can’t be given DRI status can’t be given certain project surfaces at all. Willison’s synthesis-ahead-of-mainstream pattern continues; the post is the practitioner-voice reference point the corpus will lean on for the accountability boundary from here.

DRI Post Crystallises Accountability as an Input Constraint on Agent Design (July 13, 2026): Willison’s short crisp piece grounded in the IBM 1979 slide (“A computer can never be held accountable, therefore a computer must never make a management decision”) argues LLM agents must never be the DRI on a project. The principle isn’t new, and Willison isn’t claiming it is; the post is the crispest articulation of the accountability boundary the corpus has logged, and it lands the same week Anthropic‘s Claude Code browser, Meta‘s Muse Spark 1.1, and Microsoft‘s Copilot cleave push the practical accountability question live. Structural read: the first framing in the corpus to invert accountability from downstream-of-capability to input-constraint-on-agent-design — agents that can’t be given DRI status can’t be given certain project surfaces at all. 60-day watch: whether the DRI framing shows up in a Fortune-500 rollout memo citing the IBM 1979 principle by name inside a policy document.
“Hasn’t Struck Me as Better Than Fable” as GPT-5.6 GA Reference Point (July 10, 2026): Willison’s independent hands-on on GPT-5.6 Sol GA lands as the load-bearing practitioner read on the day of OpenAI’s biggest launch of the quarter. Sol’s 53.6 vs Fable’s 40.5 on Agents’ Last Exam is the headline number OpenAI shipped; Willison’s “so far it hasn’t struck me as better than Fable at the kind of complex coding tasks I’ve been using” is the counter-frame the corpus is now anchoring the GA reception around. Combined with SWE-Bench Pro (Fable 80% vs Sol 64.6%) and the Aider polyglot freeze (GPT-5 May 2026 still #1 at 88.0%), the disciplined framing to carry is that Anthropic retains the coding-quality lead. Pattern-matches Willison’s earlier synthesis-ahead-of-mainstream role — “best model crown changed hands five times in six months,” “Claws” category naming, “daily-driver reliability” — as the practitioner-voice reference the corpus leans on for durable framing.
“AI as Deployer’s Agent” Liability Frame Surfaces as Practitioner-Voice Anchor (June 26, 2026): Willison’s amplification of Bruce Schneier’s normative argument — that AI agents should be treated as agents of the deployer rather than as third-party tools or autonomous actors — is the most articulate version of the deployer-liability frame to date. Disciplined caveat the corpus carries: two named voices is not consensus; no court ruling, regulator, or industry-policy statement has endorsed the frame yet. The developer-facing implication Willison adds — that model-vendor indemnification clauses get pulled into the liability question — is the practitioner-grade angle, and the watch item is whether a parallel argument surfaces in an EU AI Act enforcement action, a U.S. tort filing, or TOS revisions inside 60 days.

Key Developments

Practitioner Synthesis as Corpus Working Frame: Willison’s “last six months” post is the synthesis the back half of 2026 will be read against — five frontier-crown handovers, coding agents at daily-driver reliability, and 20GB local-laptop models within reach of proprietary frontier are the three currents the corpus is now anchoring to.
Local-Inference Floor Naming: Willison’s reference points (GLM-5.1, Qwen 3.6-35B-A3B at 20.9GB quantised) are what the corpus now uses to describe the “frontier-on-a-laptop” floor. The pelican-on-bicycle SVG benchmark continues to climb as the practitioner-flavored capability marker.
Vibe-Coding / Agentic-Engineering Convergence: Willison’s framing that the two practices are collapsing into one — at different velocity-and-oversight settings — is the conceptual coda to the Airbnb/Snap/Google ”% AI-authored code” CEO disclosures, recasting the question from “which tool serious engineers use” to “which tool serves the full velocity spectrum.”