GPT-5

Overview

GPT-5 is OpenAI’s frontier model, appearing in benchmarks in mid-2026. On the SOOHAK benchmark (CMU/EleutherAI/Seoul National University), GPT-5 scores approximately 26% on solvable math problems (Avg@3), placing it second behind Gemini 3 Pro (~30%) and above Claude Opus 4.5 (~10%). No model including GPT-5 clears 50% on recognizing unsolvable problems, with the best score ~49%.

Timeline

2026-05-18-AI-Digest — Cited in SOOHAK benchmark results: GPT-5 scores ~26% (Avg@3) on solvable IMO-medalist-validated math problems, placing second behind Gemini 3 Pro (~30%) and above Claude Opus 4.5 (~10%). On the refusal axis (recognizing intentionally unsolvable problems), no model including GPT-5 clears 50% — the paper’s key finding is that scaling does not move the refusal number meaningfully.
2026-05-23-AI-Digest — Continues sweeping four of five slots on the Aider polyglot top-5 (gpt-5 high 88.0%, gpt-5 medium 86.7%, o3-pro third at 84.9%, gemini-2.5-pro-preview-06-05 32k think at 83.1%, gpt-5 low 81.3%). Board is identical to yesterday’s snapshot; the durability of the GPT-5 dominance against the absence of any Gemini 3.x entry is itself the data point.
2026-05-26-AI-Digest — GPT-5 still holds four of five Aider polyglot top-5 slots at the May 26 fetch (gpt-5 high 88.0%, gpt-5 medium 86.7%, o3-pro 84.9%, gemini-2.5-pro-preview-06-05 32k think 83.1%, gpt-5 low 81.3%). Aider page footer still reads “last updated November 20, 2025” — the staleness disclaimer from 2026-05-24-AI-Digest still applies, but on this canonical leaderboard the frontier-quality tier remains a GPT-5 four-of-five sweep with gemini-2.5-pro-preview-06-05 holding the only non-OpenAI slot.
2026-05-27-AI-Digest — GPT-5 continues the four-of-five sweep on the Aider polyglot top-5 (gpt-5 high 88.0%, gpt-5 medium 86.7%, o3-pro 84.9%, gemini-2.5-pro-preview-06-05 32k think 83.1%, gpt-5 low 81.3%) — board is identical to yesterday’s snapshot. The durability of the GPT-5 sweep against the absence of any Gemini 3.5 Flash entry across multiple consecutive days is the continuing read; the canonical practitioner leaderboard’s frontier-quality tier remains a GPT-5 sweep with gemini-2.5-pro-preview-06-05 the only non-OpenAI slot.
2026-05-29-AI-Digest — GPT-5 still tops the Aider polyglot top-5 (fetched 2026-05-29): gpt-5 high 88.0% · gpt-5 medium 86.7% · o3-pro high 84.9% · gemini-2.5-pro-preview-06-05 32k think 83.1% · gpt-5 low 81.3%. The four-of-five GPT-5 sweep is unchanged, holding as the comparison anchor against Anthropic‘s same-day Claude Opus 4.8 launch and its Terminal-Bench 2.1 claims (Aider remains the canonical practitioner code reference).
2026-05-31-AI-Digest — Aider polyglot top-5 (fetched 2026-05-31): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3% — unchanged from yesterday’s snapshot. The four-of-five GPT-5 sweep with gemini-2.5-pro-preview-06-05 the only non-OpenAI slot continues to be the durable read on the canonical practitioner code leaderboard.
2026-06-01-AI-Digest — Aider polyglot top-5 (fetched 2026-06-01): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3% — unchanged for the third straight day. The bench is sitting still; the GPT-5 four-of-five sweep with gemini-2.5-pro-preview-06-05 as the only non-OpenAI slot continues to anchor the canonical practitioner code leaderboard.
2026-06-02-AI-Digest — Aider polyglot top-5 (fetched 2026-06-02): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3%. Same top-5, same percentages, same outlier shape as last week — the bench remains sitting still and the GPT-5 four-of-five sweep persists. Same digest references GPT-5.5 (and GPT-5.4) going GA on AWS Bedrock alongside Codex, the first multi-cloud OpenAI frontier-model distribution since Microsoft exclusivity ended.
2026-06-06-AI-Digest — Aider polyglot top-5 (fetched 2026-06-06): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3%. Unchanged from the prior snapshot — GPT-5 retains the four-of-five sweep with gemini-2.5-pro-preview-06-05 the lone non-OpenAI slot. The board functions as today’s calibration anchor for the corpus’s “on-device substrate shifts down a tier without catching the frontier” read on Gemma 4 QAT.
2026-06-09-AI-Digest — Aider polyglot top-5 (fetched 2026-06-09): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3%. Unchanged from 2026-06-08-AI-Digest for the second day running — the four-of-five GPT-5 sweep remains the capability-ceiling reference point against today’s Xiaomi MiMo-v2.5-Pro-UltraSpeed inference-speed-frontier release. Today’s Key Takeaway frames it explicitly: capability ceiling stays GPT-5, cost disruption stays DeepSeek, inference-speed frontier is now Xiaomi.
2026-06-10-AI-Digest — Aider polyglot top-5 (fetched 2026-06-10): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3% — unchanged from yesterday’s snapshot. The board functions today as the cross-check against Anthropic‘s same-day Claude Fable 5 launch: Anthropic’s release page anchors on SWE-Bench Pro at 80.3% (vs Opus 4.8 69.2% and GPT-5.5 58.6%), but Fable 5 is not yet rated on Aider polyglot — and SWE-Bench Pro is a different benchmark from Aider’s polyglot mix, so cross-board transfer is the open question worth watching. The corpus is now waiting to see whether Fable 5 lands above the gpt-5 (high) 88.0 number or below.
2026-06-12-AI-Digest — Aider polyglot top-5 (fetched 2026-06-12): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3%. Unchanged from yesterday. Today’s Technical News uses GPT-5.5’s $5/M input · $30/M output as the capability-tier anchor for the OpenAI price-war framing — Anthropic‘s Fable 5 launched at $10/$50 (≈ 2× GPT-5.5), OpenAI is weighing token-price cuts (Altman calls cost “a huge issue”), and price-per-token + capability now read as coupled axes of a tier rather than separate races.
2026-06-11-AI-Digest — Aider polyglot top-5 (fetched 2026-06-11): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3%. Three of five rungs are GPT-5 — the capability ceiling story hasn’t moved this week. Today’s Key Takeaways use this board as the anchor against the DeepSeek cost-curve framing (capability race firmly US-led at the ceiling, with cost-per-token now genuinely separating into a different race with different customers).
2026-06-07-AI-Digest — Aider polyglot top-5 (fetched 2026-06-07): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3%. Identical to the 2026-06-06-AI-Digest snapshot — the public board has now been frozen at the 2025-11-20 refresh for over six months and is functioning as a reference floor, not a leading indicator. Caveat: several Jun 2026 open-weight releases (DeepSeek V4-Pro, Qwen3-Coder-Next, MiniMax M3) have not been benchmarked, and at least one community-maintained Aider mirror puts DeepSeek-V3.2-Exp into top-5 at ~74%. The four-of-five GPT-5 sweep is the closed-reasoning ceiling against which today’s “open-vs-closed” framing is being calibrated, not today’s live scorecard.
2026-06-13-AI-Digest — Aider polyglot top-5 (fetched 2026-06-13): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3%. Three of five rungs are GPT-5, but SWE-Bench Verified now has Claude Mythos 5 (95.5%), Claude Fable 5 (95%), and Claude Opus 4.8 (88.6%) sweeping its top three — the two leaderboards are now reliably disagreeing on the coding race. The story today is benchmark divergence, not an “OpenAI coding comeback”: any single-leaderboard read on the coding race is partial.
2026-06-15-AI-Digest — Aider polyglot top-5 (fetched 2026-06-15): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3%. Identical to yesterday’s and Saturday’s row order and percentages — 72 hours frozen. The SWE-Bench Verified top three (Claude Mythos 5 95.5%, Claude Fable 5 95%, Claude Opus 4.8 88.6%) is unchanged in print but Mythos 5 and Fable 5 remain globally disabled, so the published SWE-Bench frontier has been inaccessible to API callers for ~72 hours. Treat any “OpenAI sweeps coding this week” read as an artefact of the disable, not a competitive shift. The corpus’s “Aider vs SWE-Bench divergence” thread stays on pause until either Anthropic reactivates the disabled tier or a fresh tag overtakes.
2026-06-17-AI-Digest — Aider polyglot top-5 (fetched 2026-06-17): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3%. Identical ordering and percentages to last week — GPT-5 sweeps four of five slots, Gemini 2.5 Pro holds fourth, no Claude Opus 4.8 entry yet. The corpus callout from today’s [!note]: treat the stability as “no new frontier coding model has cleared the bar this week,” not as a fresh ranking event. The frozen-leaderboard framing is itself the signal in the post-Fable-5 window.
2026-06-18-AI-Digest — Aider polyglot top-5 (fetched 2026-06-18): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3%. Eight days frozen — same ordering, same percentages as last week and as 2026-06-17-AI-Digest; the agentic-coding bar has not moved during the entire Claude Fable 5 / Claude Mythos 5 shutdown window. Today’s GLM-5.2 / Artificial Analysis read sharpens the juxtaposition: on agentic / polyglot coding the closed-source frontier (GPT-5 four-of-five plus o3-pro plus Gemini 2.5 Pro, no open-weights entry) still leads cleanly, while on the general-intelligence and frontend-coding axes open-weights leadership is real and accelerating. Two coding axes, different leaders.
2026-06-19-AI-Digest — Aider polyglot top-5 (fetched 2026-06-19): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3%. Nine days frozen. GLM 5.2 now durably tops the open-weights distribution but remains absent from the polyglot top-5; two coding axes, different leaderboards. The closed-source frontier still leads cleanly on agentic / polyglot coding while the Fable 5 / Mythos 5 shutdown window stays open. GPT-5 is also the comparison anchor in today’s OpenAI pre-IPO bench-stack story — Shazeer and Dean Ball were hired against a backdrop where the GPT-5 sweep is the durable capability ceiling reference.
2026-06-20-AI-Digest — Aider polyglot top-5 (fetched 2026-06-20): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3%. Ten days frozen — double-digit streak now, same five rows and same percentages as yesterday and the day before. GLM 5.2 continues to top the Artificial Analysis open-weights distribution but does not appear on this leaderboard; two coding axes, different leaderboards, persistence is the signal not the absence. Anchors today’s Anthropic–DeepMind–OpenAI talent-flow framing (Jumper to Anthropic, Shazeer to OpenAI) against an unchanged capability-ceiling reference.
2026-06-21-AI-Digest — Aider polyglot top-5 (fetched 2026-06-21): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3%. Eleven days frozen. DeepSeek-V3.2-Exp surfaces at 0.745 on the broader board as next-closest open-weights below the closed top-5 lock; GLM 5.2 still tops the Artificial Analysis open-weights distribution on a different axis. The corpus framing this digest formalises: stop calling it a “streak” and start calling it the state of play — closed-model dominance on the polyglot axis is not transient. Today’s GPT-5 sweep is the durable capability-ceiling reference anchoring the digest’s “agent-platform layer is forming this weekend across three vendors” key takeaway.
2026-06-23-AI-Digest — Aider polyglot top-5 (fetched 2026-06-23): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3%. Day thirteen of the polyglot freeze — same five rows, same percentages as the rolling print since 2026-06-12-AI-Digest. Two new framings today: (1) today’s PlanBench-XL paper documents GPT-5.4 collapsing from 51.90% to 11.36% accuracy under severe tool-blocking on 327 retail tasks across 1,665 tools — concrete evidence that frontier agents collapse when tool environments are imperfect, used in the digest body as counter-evidence to the “loops are dominant” thesis. (2) External reporting that GLM 5.2 claims wins against GPT-5 on SWE-bench Pro and Terminal-Bench 2.1 suggests the frozen-polyglot frame may be eval-specific rather than capability-wide. The corpus continues holding both axes — closed-source frontier still leads the agentic-polyglot bar; open-weights sentiment moves on different evals — without collapsing them.
2026-06-25-AI-Digest — Aider polyglot top-5 (fetched 2026-06-25): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3%. Day fifteen of the polyglot freeze — same five rows, same percentages as 2026-06-24-AI-Digest and every print going back to 2026-06-12-AI-Digest. The digest’s [!note] softens the framing: the freeze coincides with a release-cadence lull (no new flagship drop in the window), so it’s at least as consistent with a sampling artifact as a capability plateau — re-test on the next flagship release. Independent leaderboards still show open-weights models cracking rank 5 below the Aider cut (DeepSeek-V3.2-Exp sits in the mid-70s on equivalent polyglot evals) so the closed top-5 lock holds while the broader open-vs-closed gap below it continues to narrow.
2026-06-26-AI-Digest — Aider polyglot top-5 (fetched 2026-06-26): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3%. Day sixteen of the polyglot freeze — same five rows, same percentages as 2026-06-25-AI-Digest and every print since 2026-06-12-AI-Digest. The digest holds the softened framing from yesterday: the freeze coincides with a release-cadence lull, so a sampling artifact is at least as consistent with the data as a capability plateau — re-test on the next flagship drop; open-weights continuing to crack rank 5 below the cut is the secondary axis to keep watching.
2026-06-29-AI-Digest — Aider polyglot top-5 (fetched 2026-06-29): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3%. Day nineteen of the polyglot freeze — same five rows, same percentages as 2026-06-28-AI-Digest and every print back to 2026-06-12-AI-Digest, extending the longest unbroken freeze the corpus has recorded. The framing the corpus carries: with GPT-5.6 Sol still under the customer-by-customer access regime and Mythos only restored to ~100 trusted partners, Aider cannot realistically sample either of the two highest-altitude tiers — the freeze remains an artifact of gated-access timing, not a benchmark plateau. The secondary axis worth watching is whether GLM 5.2 surfaces on the polyglot leaderboard outside the top-5 cut in the next print, given today’s Semgrep cyber-bench result.
2026-07-02-AI-Digest — Aider polyglot top-5 (fetched 2026-07-02): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3%. Day twenty-two of the polyglot freeze — same five rows, same percentages as every print back to 2026-06-12-AI-Digest, the longest unbroken freeze the corpus has recorded, now well into its fourth week. Claude Sonnet 5 shipped inside the window on 2026-06-30-AI-Digest and has not yet posted a polyglot number; the typical Aider-inclusion lag for a frontier release is 1–3 weeks, so day 22 is not yet the definitive test of whether the freeze survives Sonnet 5.
2026-07-05-AI-Digest — Aider polyglot top-5 (fetched 2026-07-05): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3%. Day twenty-four of the polyglot freeze — a full four weeks of the same top-5 stretching back to 2026-06-12-AI-Digest. Neither Claude Sonnet 5 nor the redeployed Claude Fable 5 has posted polyglot numbers, and the GPT-5.6 Sol limited-preview cohort excludes public benchmarking. The corpus reads the freeze as an evaluation gap driven by Aider inclusion lag rather than a capability plateau — GPT-5’s four-of-five sweep is the durable capability-ceiling reference against a live but not-yet-scored frontier tier. Same digest surfaces the community-filed GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance HN thread (202 pts / 70 cmts) as a public post-mortem-style signal worth cross-checking against the upcoming Sol Pro / Terra Pro / Luna Pro benchmarks.
2026-07-08-AI-Digest — Aider polyglot top-5 (fetched 2026-07-08): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3%. Day twenty-six of the polyglot freeze — same five rows and same percentages as every print back to 2026-06-12-AI-Digest, the longest recorded unbroken freeze in the corpus. Today’s [!note] reframing holds from yesterday: this is evaluation lag, not a benchmark ceiling — GPT-5.6 Sol and Claude Sonnet 5 remain unscored on the public leaderboard while Anthropic‘s Opus 4.5 print sits above the top row at 89.4%. Wait for one of them to land a public score before treating the freeze as anything else.
2026-07-07-AI-Digest — Aider polyglot top-5 (fetched 2026-07-07): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3%. Day twenty-five of the polyglot freeze — same five rows and same percentages as every print back to 2026-06-12-AI-Digest, the corpus’s longest recorded unbroken freeze extends by another day. The digest’s [!note] softens yesterday’s “benchmark-saturation” framing: Anthropic has reported Opus 4.5 at 89.4% on polyglot, above the 88.0% top row here, and the benchmark was specifically redesigned to avoid the saturation the Python-only predecessor hit at 80%+. The parsimonious read now is evaluation lag, not benchmark ceiling — GPT-5.6 Sol and Claude Sonnet 5 are both unscored on the public leaderboard, so wait for one of them to land a score before treating the freeze as a saturation artifact.

Key Developments

SOOHAK Benchmark (May 2026): Second on solvable problems at ~26% Avg@3. More importantly, fails to clear the 50% refusal threshold alongside all other frontier models — the benchmark’s load-bearing diagnostic is that confident wrong answers on unsolvable problems is the failure mode aggregate accuracy benchmarks systematically hide.

2026-07-16-AI-Digest — Aider polyglot top-5 (fetched today) still leads with gpt-5 (high) at 88.0%, gpt-5 (medium) at 86.7%, o3-pro (high) at 84.9%, gemini-2.5-pro-preview-06-05 (32k think) at 83.1%, and gpt-5 (low) at 81.3% — GPT-5 variants sweep four of five polyglot slots, with GPT-5.6 Sol absent from the board. Also today: GPT-Red‘s baseline attack-success rate cites 95%+ against GPT-5.1, the pre-hardened Sol predecessor — the specific model GPT-Red was shown to break at scale before OpenAI hardened Sol to <10%. GPT-5 lineage is now the benchmark defender-side and the leaderboard incumbent in the same news cycle.
2026-07-17-AI-Digest — Aider polyglot top-5 (fetched 2026-07-17): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3% — unchanged from yesterday, GPT-5 variants continue sweeping four of five slots. The digest carries the disciplined framing that this is a snapshot benchmark materially trailing the open-weight release cycle — Kimi K3 shipping today at Sonnet-tier pricing is not yet scored. GPT-5 lineage remains the leaderboard incumbent while the K3 pricing entry pressures the commodity tier the leaderboard doesn’t measure directly.
2026-07-21-AI-Digest — Aider polyglot top-5 (fetched 2026-07-21): 1. gpt-5 (high) — 88.0% · 2. gpt-5 (medium) — 86.7% · 3. o3-pro (high) — 84.9% · 4. gemini-2.5-pro-preview-06-05 (32k think) — 83.1% · 5. gpt-5 (low) — 81.3% — three GPT-5 slots plus one o3-pro is the sharpest instance yet of OpenAI dominance on this specific eval. Digest carries a load-bearing cross-check: SWE-Bench Verified as of July has Claude Mythos 5 at 95.5% and Claude Opus 4.7 holding #1 through June — so the “gpt-5 lock-in” read is Aider-polyglot-specific, not universal. The bench-split is now durable enough that “which benchmark are you optimising for” is a real routing decision, not a rhetorical question. GPT-5’s leaderboard incumbency continues to anchor the closed-frontier ceiling on this eval while Kimi K3 pricing pressures the commodity tier below it.

GPT-5

Overview

Timeline

Key Developments

Related