Large Language Models, Playable

EP 01

Models don't read words — they read tokens

Before a model sees your text, it gets chopped into tokens — frequent chunks that may be whole words, pieces of words, or punctuation. Type anything below and watch it happen. Notice: common words survive whole, rare words get split.

Simplified subword tokenizer for illustration — real ones (BPE) learn their chunks from data, but split text the same way in spirit.

Why it matters: models are priced, limited and sometimes confused per token. “Unbelievable” costing 3 tokens while “cat” costs 1 is why long rare words eat your context window faster.

EP 02

Watch it guess — a tiny LLM living in this page

This is a real (tiny!) language model trained right here in your browser on a few paragraphs of text. At every step it looks at recent words, computes a probability for each possible next word, and samples one. The bars are its actual internal probabilities — GPT does the same thing with 100,000+ options instead of a handful.

Temperature 0.8

Temperature 0.8 — balanced: mostly picks likely words, occasionally surprises.

Turn the knob: at temperature 0 it always takes the top bar (deterministic, repetitive). Near 2 it treats bad options almost like good ones (creative, then incoherent). That one slider is most of the difference between a boring assistant and a hallucinating one.

EP 03

Why context is everything

Same machine, one change: how many recent words it's allowed to see before guessing. Watch the predictions for the sentence below sharpen as you give the model more memory. This is the intuition behind “context windows” — and why models with amnesia ramble.

Model sees last

The pattern: blind → grammar soup. One word → plausible phrases. Two words → it locks onto the sentence. Real LLMs push this from 2 words to hundreds of thousands of tokens — that leap, plus scale, is the whole revolution.

EP 04

Fluency is not truth — make it hallucinate

The model can only remix what it was trained on — but it never says "I don't know" by default. Ask it about something inside its tiny world, then about something it has never seen. Watch the confidence stay high either way.

✓ good case✗ bad case

Pick a prompt above.

The scary part: the grammar quality is identical in both runs. Fluency comes from word statistics; truth would require knowledge it doesn't have. That gap — confident text with nothing behind it — is exactly what a hallucination is in GPT-scale models too.

EP 05

The doom loop — why greedy decoding repeats itself

Set temperature to zero and the model always picks the single most likely word. Sounds sensible — but a deterministic path through a finite vocabulary must eventually revisit a state, and from there it loops forever. Run it and catch the loop; then let a little randomness break the spell.

✗ bad case✓ good case

Run it — the repeated phrase gets underlined.

Why it matters: this is the exact failure that "repetition penalty" and sampling exist to fix. Greedy is the good case only when there is one right answer (math, code fixes); for open text it's a trap. Next time a chatbot repeats itself, you know what happened inside.