GLM-5.2 just took the open-weights crown - and it thinks a lot
I have a folder on my laptop called model-sweep-june. It is full of half-finished eval scripts, API keys I forgot to revoke, and terminal logs where some model spent forty thousand tokens deciding whether a Nim function should be called eval or evaluate. I opened it again this morning because HN was yelling about GLM-5.2, and I wanted to see if the hype matched my receipts.
It mostly does. With a caveat I will get to.
The story in one sentence
Z.ai’s GLM-5.2 is now the top-scoring open-weights model on Artificial Analysis’s Intelligence Index v4.1 at 51 points, up eleven from GLM-5.1 at the same 744B-total / 40B-active parameter size, MIT-licensed, with a 1M-token context window and API pricing that still undercuts the frontier labs by an embarrassing margin.
That is the headline. The subtext is that open-weights AI is no longer chasing the proprietary frontier from three counties away. It is sitting at the same table, ordering the same agentic benchmarks, and occasionally leaving with a higher GDPval score than GPT-5.5.
Why this hit the front page
Timing, mostly. Artificial Analysis had just shipped Intelligence Index v4.1 the day before - a benchmark refresh that tilts harder toward agentic workloads, longer trajectories, and real-world knowledge tasks. GLM-5.2 dropped into that new scoreboard and immediately topped the open-weights column.
But HN did not upvote a changelog. They upvote a shift in the power map.
For years the open-weights conversation was “good enough for fine-tuning” or “great if you self-host on a rack you do not own.” GLM-5.2 is being sold as something else: a model that scores 1524 on GDPval-AA v2, essentially tied with GPT-5.5 (xhigh, 1514), and ahead of MiniMax-M3 (1418) and DeepSeek V4 Pro max (1328). GDPval-AA v2 is Artificial Analysis’s agentic real-world work benchmark - not trivia, not single-turn coding puzzles.
The numbers that matter if you actually run models:
| Metric | GLM-5.2 | GLM-5.1 | MiniMax-M3 | DeepSeek V4 Pro (max) |
|---|---|---|---|---|
| Intelligence Index v4.1 | 51 | 40 | 44 | 44 |
| GDPval-AA v2 | 1524 | - | 1418 | 1328 |
| Output tokens per task | 43k | 26k | 24k | 37k |
| Cost per Intelligence Index task | ~$0.46 | ~$0.25 | ~$0.18 | ~$0.05 |
| Context window | 1M | 200K | - | - |
| License | MIT | - | - | - |
Official API pricing stayed flat at $1.4 / $4.4 / $0.26 per million input / output / cache-hit tokens. Same bill as 5.1, more brain per dollar - at least on paper.
Scientific reasoning jumped hardest: CritPt +16 points to 21%, HLE +12 to 40%, TerminalBench v2.1 +16 to 78%, GPQA Diamond +3 to 89%. Artificial Analysis also notes lower hallucination rate (28.1% vs 29.4%) on their Omniscience Index. Small deltas, but the direction is consistent: this is not a rebranding exercise.
--> // making it invisible to querySelectorAll. // // `data-cfasync="false"` keeps this rescue script executable even when // Rocket Loader is active. It rescues module scripts via two strategies: // 1. Query the DOM for type$="-module" + src (covers case A) // 2. Regex-parse the raw HTML for commented-out script tags (covers case B) // Dynamically-created scripts bypass Rocket Loader entirely. (function () { if (window.__markdyRescue) return; window.__markdyRescue = true; var rescued = false; function rescueModuleScripts() { if (rescued) return; rescued = true; var srcs = []; // Strategy 1: Rocket Loader kept the tag in DOM but changed the type. // type="module" → type="{uuid}-module" (still has src attribute) document.querySelectorAll('script[type$="-module"][src]').forEach(function (s) { srcs.push(s.src); }); // Strategy 2: Rocket Loader COMMENTED OUT the script tag entirely: // // These are invisible to querySelectorAll, so we parse the raw HTML. // We handle both attribute orderings (type-first or src-first). var html = document.documentElement.innerHTML; var reSrcFirst = //g; var reTypeFirst = //g; var m; while ((m = reSrcFirst.exec(html)) !== null) { srcs.push(m[1]); } while ((m = reTypeFirst.exec(html)) !== null) { srcs.push(m[1]); } // Re-inject each found src as a real module script. // Deduplicate first, then inject. Dynamically-created scripts bypass // Rocket Loader entirely. Modules with the same URL are only executed // once by the browser (cached), so re-injecting already-running scripts // is safe. var seen = {}; srcs.forEach(function (src) { if (seen[src]) return; seen[src] = true; var fix = document.createElement('script'); fix.type = 'module'; fix.src = src; fix.setAttribute('data-cfasync', 'false'); document.head.appendChild(fix); }); } // Rescue when user clicks the placeholder (fallback if autoplay failed). document.addEventListener('click', function (e) { var t = e.target; if (t && typeof t.closest === 'function' && t.closest('.markdy-placeholder')) { rescueModuleScripts(); } }); // Rescue automatically after a short delay for autoplay. // Only fires if initAll() never ran (no data-markdy-init on any root). setTimeout(function () { if (document.querySelector('.markdy-root:not([data-markdy-init])')) { rescueModuleScripts(); } }, 1500); }());The thread, honestly
The top comment on HN is not “wow, China wins.” It is Tiberium complaining that GLM-5.2 xhigh spent fifteen minutes and ~45k tokens before writing the first line of a simple Nim math-evaluator library - a task that should land in a few hundred lines. Artificial Analysis’s own charts show GLM-5.2 as one of the least token-efficient open models at its intelligence tier: 43k output tokens per benchmark task, 37k of that reasoning. GPT-5.5 xhigh, by comparison, spends around 16k.
Raw intelligence went up. Thinking discipline did not keep pace.
That tension runs through the whole thread. alansaber says open-weights models still feel lackluster in multi-turn agent mode - less RL, worse product shape, while frontier labs have been optimizing for agentic loops for years. mrngld points at Artificial Analysis’s coding-agent charts and argues GLM-5.1 xhigh was already twice the cost of GPT-5.5 medium for half the intelligence; bridging that gap with 5.2 is a big ask.
Then there is the counter-current. unrvl22 asks why more people are not talking about resellers offering unlimited tokens for $50/month, or API rates 3x below Z.ai’s official pricing - which is already ~10x cheaper than Opus-class models. For a lot of HN readers, the official API price is a fiction. The real market is Crof, Umans, and whatever Discord bot is hosting the weights this week.
kristopolous shared a daily script that ranks models by coding index from Artificial Analysis’s JSON feed. GLM-5.2 had been live for hours and was already climbing his personal leaderboard. Benchmarks are not gospel, but when your daily driver script moves, you notice.
Should you care?
| Read the original if… | Skip if… |
|---|---|
| You pick open-weights models for production and need a current leaderboard snapshot | You already run DeepSeek or Kimi and are happy |
| You care about agentic GDPval-style benchmarks, not just HumanEval | You only trust your own evals and ignore third-party indices |
| You want MIT-licensed weights with 1M context at sub-frontier pricing | Token efficiency matters more than peak score - GLM-5.2 thinks long |
| You buy through resellers and want to know what the underlying model can do | You think benchmark gaming makes all leaderboard posts noise |
My read: GLM-5.2 is the most credible “open weights is catching up” story since DeepSeek R1 shocked everyone in January. It is also a model that solves problems by throwing reasoning tokens at them until the benchmark surrenders. That works on Artificial Analysis’s scoring. It may not work on your latency budget.
The crown moved east again. The king just takes a very long time to put it on.
Discussion on Hacker News · Source: artificialanalysis.ai · Submitted by himata4113
Hoang Yell
A software developer and technical storyteller. I read Hacker News every day and retell the best stories here — in English and Vietnamese — for curious people who don't have time to scroll.