GLM-5.2 just took the open-weights crown - and it thinks a lot

I have a folder on my laptop called model-sweep-june. It is full of half-finished eval scripts, API keys I forgot to revoke, and terminal logs where some model spent forty thousand tokens deciding whether a Nim function should be called eval or evaluate. I opened it again this morning because HN was yelling about GLM-5.2, and I wanted to see if the hype matched my receipts.

It mostly does. With a caveat I will get to.

The story in one sentence

Z.ai’s GLM-5.2 is now the top-scoring open-weights model on Artificial Analysis’s Intelligence Index v4.1 at 51 points, up eleven from GLM-5.1 at the same 744B-total / 40B-active parameter size, MIT-licensed, with a 1M-token context window and API pricing that still undercuts the frontier labs by an embarrassing margin.

That is the headline. The subtext is that open-weights AI is no longer chasing the proprietary frontier from three counties away. It is sitting at the same table, ordering the same agentic benchmarks, and occasionally leaving with a higher GDPval score than GPT-5.5.

Why this hit the front page

Timing, mostly. Artificial Analysis had just shipped Intelligence Index v4.1 the day before - a benchmark refresh that tilts harder toward agentic workloads, longer trajectories, and real-world knowledge tasks. GLM-5.2 dropped into that new scoreboard and immediately topped the open-weights column.

But HN did not upvote a changelog. They upvote a shift in the power map.

For years the open-weights conversation was “good enough for fine-tuning” or “great if you self-host on a rack you do not own.” GLM-5.2 is being sold as something else: a model that scores 1524 on GDPval-AA v2, essentially tied with GPT-5.5 (xhigh, 1514), and ahead of MiniMax-M3 (1418) and DeepSeek V4 Pro max (1328). GDPval-AA v2 is Artificial Analysis’s agentic real-world work benchmark - not trivia, not single-turn coding puzzles.

The numbers that matter if you actually run models:

Metric	GLM-5.2	GLM-5.1	MiniMax-M3	DeepSeek V4 Pro (max)
Intelligence Index v4.1	51	40	44	44
GDPval-AA v2	1524	-	1418	1328
Output tokens per task	43k	26k	24k	37k
Cost per Intelligence Index task	~$0.46	~$0.25	~$0.18	~$0.05
Context window	1M	200K	-	-
License	MIT	-	-	-

Official API pricing stayed flat at $1.4 / $4.4 / $0.26 per million input / output / cache-hit tokens. Same bill as 5.1, more brain per dollar - at least on paper.

Scientific reasoning jumped hardest: CritPt +16 points to 21%, HLE +12 to 40%, TerminalBench v2.1 +16 to 78%, GPQA Diamond +3 to 89%. Artificial Analysis also notes lower hallucination rate (28.1% vs 29.4%) on their Omniscience Index. Small deltas, but the direction is consistent: this is not a rebranding exercise.

Markdy animation

The thread, honestly

The top comment on HN is not “wow, China wins.” It is Tiberium complaining that GLM-5.2 xhigh spent fifteen minutes and ~45k tokens before writing the first line of a simple Nim math-evaluator library - a task that should land in a few hundred lines. Artificial Analysis’s own charts show GLM-5.2 as one of the least token-efficient open models at its intelligence tier: 43k output tokens per benchmark task, 37k of that reasoning. GPT-5.5 xhigh, by comparison, spends around 16k.

Raw intelligence went up. Thinking discipline did not keep pace.

That tension runs through the whole thread. alansaber says open-weights models still feel lackluster in multi-turn agent mode - less RL, worse product shape, while frontier labs have been optimizing for agentic loops for years. mrngld points at Artificial Analysis’s coding-agent charts and argues GLM-5.1 xhigh was already twice the cost of GPT-5.5 medium for half the intelligence; bridging that gap with 5.2 is a big ask.

Then there is the counter-current. unrvl22 asks why more people are not talking about resellers offering unlimited tokens for $50/month, or API rates 3x below Z.ai’s official pricing - which is already ~10x cheaper than Opus-class models. For a lot of HN readers, the official API price is a fiction. The real market is Crof, Umans, and whatever Discord bot is hosting the weights this week.

kristopolous shared a daily script that ranks models by coding index from Artificial Analysis’s JSON feed. GLM-5.2 had been live for hours and was already climbing his personal leaderboard. Benchmarks are not gospel, but when your daily driver script moves, you notice.

Should you care?

Read the original if…	Skip if…
You pick open-weights models for production and need a current leaderboard snapshot	You already run DeepSeek or Kimi and are happy
You care about agentic GDPval-style benchmarks, not just HumanEval	You only trust your own evals and ignore third-party indices
You want MIT-licensed weights with 1M context at sub-frontier pricing	Token efficiency matters more than peak score - GLM-5.2 thinks long
You buy through resellers and want to know what the underlying model can do	You think benchmark gaming makes all leaderboard posts noise

My read: GLM-5.2 is the most credible “open weights is catching up” story since DeepSeek R1 shocked everyone in January. It is also a model that solves problems by throwing reasoning tokens at them until the benchmark surrenders. That works on Artificial Analysis’s scoring. It may not work on your latency budget.

The crown moved east again. The king just takes a very long time to put it on.

Discussion on Hacker News · Source: artificialanalysis.ai · Submitted by himata4113

GLM-5.2 just took the open-weights crown - and it thinks a lot

The story in one sentence

Why this hit the front page

The thread, honestly

Should you care?

Hoang Yell

More like this