The Wildly Simple Architecture Running Our AI Moment

5 min read Tiếng Việt
Featured image for The Wildly Simple Architecture Running Our AI Moment

A while ago a colleague showed me the king − man + woman ≈ queen vector arithmetic trick. I’d seen it before, but watching him type it into a REPL like it was a party trick made something click: the model has no idea what a king is. It just learned that those four word-shapes tend to live in a particular geometric relationship. That’s not intelligence. That’s pattern-matching at cosmic scale.

Which made me want to understand the actual machinery. Not the vibes - the machinery.

0xkato’s “How LLMs Actually Work” is the post I wish I’d had three years ago. It’s a 26-minute, ground-up walkthrough of the transformer architecture that powers every major LLM right now - without the sticky differential equations. By the end, you can read a model card and know exactly which layer each specification line is talking about.

What the post actually covers

The path is clean:

  1. Tokenization - text becomes integers; “tokenization” splits into ["token", "ization"]
  2. Embeddings - those integers get looked up in a giant table; each token becomes a 4,096-float vector in a 7B model
  3. Positional encoding - specifically RoPE, which rotates vectors by angle-per-position instead of adding a position blob, now used in LLaMA, Mistral, Gemma, Qwen
  4. Attention - Query/Key/Value: each token asks “what am I looking for?” and receives a weighted mix of other tokens’ Values
  5. Multi-head attention - 32 heads in parallel, each learning different relationship types: grammar, pronouns, long-range references
  6. Feed-forward networks - the unglamorous step where most parameters actually live: expand → nonlinearity → compress, run per token independently
  7. Residual stream + normalization - the “additive” trick borrowed from ResNet that makes 100-layer networks trainable
  8. Next-token prediction - one softmax over the whole vocabulary, then the loop repeats

One of the sharpest details in the post: in Grouped-Query Attention, LLaMA-2 70B runs 64 query heads but only 8 key/value heads. Same quality, fraction of the KV-cache memory. That’s why long contexts used to be prohibitively expensive and now they’re merely expensive.

Another one: Mixtral 8x7B has 46.7 billion total parameters but only routes each token through about 12.9 billion of them. Mixture-of-experts lets you grow parameter count without growing inference cost linearly.

Why this hit 252 points

There are a thousand “how transformers work” posts. Most either stay too surface-level or disappear into matrix multiplication notation before page two. This one hits the Goldilocks zone: it uses plain prose with tight little “Tiny explainer” boxes wherever a concept needs grounding, covers nine topics in sequence without dropping a thread, and arrives at architecture-versus-weights as a real insight rather than a tacked-on conclusion.

Timing helps too. Everyone is using LLMs, half of HN works on or adjacent to AI systems, and a significant fraction are tired of treating these models like oracles. A post that demystifies the machinery lands differently in 2026 than it would have in 2020.

What HN is actually arguing about

The top comment from malwrar is worth reading alongside the article:

“The autoregressive decoder-only transformer LLM architecture as pioneered by OpenAI is wildly simple for how revolutionary its results are. […] The only reason frontier LLMs need 6-figure computers to run is because the model designers made the middle bit REALLY BIG, dimensionally speaking.”

And then: “I’ve been meanwhile watching these AI companies attempt, successfully, to sell this capability as some sort of robot consciousness hand-crafted by supergeniuses. The fact that they are getting away with it is almost as shocking to me as the discovery itself.”

That’s the thread’s real energy. Not “this post is wrong” - it’s “this post is correct, which is the most unsettling part.”

Another commenter drew an analogy to learning TCP/IP by watching raw packets over 1200-baud packet radio. The argument is the same: watch the output of a slow LLM token by token and eventually you start seeing the machinery. The mechanism is understandable. You can’t predict the tokens but you can see how they form.

Should you read the original?

Read if…Skip if…
You work with LLMs and want to know what GQA, RoPE, and SwiGLU meanYou already absorbed Karpathy’s “Let’s build GPT” and Anthropic’s circuits work
You want to read model papers without googling every second lineYou need the math - this deliberately skips the equations
You’re explaining LLMs to someone technical but not ML-backgroundYou need implementation details - this is conceptual architecture, not code

The post closes with a note that the transformer absorbed most of the field - vision, audio, language, multimodal - and that this could change. Mamba, state-space models, hybrid architectures are all credible alternatives. But the core loop of tokens-embeddings-attention-FFN-prediction is durable enough that even future architectures will probably solve these same sub-problems in recognizable form.

What I take away is the same thing malwrar did, just slightly differently: the simplicity isn’t a disappointment. It’s the plot twist. We built something that resembles understanding out of a lookup table, a dot product, and a softmax. We just made it very, very big.


Discussion on Hacker News · Source: 0xkato.xyz · Submitted by 0xkato

Hoang Yell

A software developer and technical storyteller. I read Hacker News every day and retell the best stories here — in English and Vietnamese — for curious people who don't have time to scroll.