Iteration and Tokens
Iteration in neural networks happens at two very different scales: the micro‑level of token generation and the macro‑level of training. Understanding both clarifies how models move from input to output and from random weights to accurate predictions.
Iteration during inference (generating one token at a time)
When a model like GPT‑4 responds to you, it does not produce the entire answer in one giant calculation. It works in a loop:
Step 1 – The model receives the input prompt "The capital of France is" and processes it through its layers, using the key, query, value mechanism and the arrays we discussed. It outputs a probability distribution over every possible next token—"Paris" might have the highest probability, followed by "Lyon", "France", and so on.
Step 2 – The model selects one token (usually the most probable, or samples if we want randomness). It appends that token to the input sequence, which now becomes "The capital of France is Paris".
Step 3 – It repeats the entire process, but with a crucial efficiency trick: it does not recompute keys and values for the earlier tokens. Instead, it retrieves them from the KV cache stored in memory. It only computes attention for the newest token ("Paris") against the cached keys and values of all previous tokens. This iteration continues, token by token, until the model predicts a special stop token or reaches a length limit.
Why this iterative generation matters
This loop explains why response time grows with output length—each new token adds another pass through the network. It also explains why models sometimes lose coherence in very long generations: errors can accumulate with each iteration. The KV cache is what makes this loop practical; without it, generating a 1000‑token response would require processing the entire growing sequence from scratch 1000 times, which would be impossibly slow.
Iteration during training (the weight update loop)
Before the model can generate anything sensible, it must learn. Training iteration is entirely different:
Step 1 – The model is given a batch of training examples (e.g., thousands of text snippets). It runs a forward pass to produce predictions.
Step 2 – It calculates the loss (the error) by comparing its predictions to the correct targets.
Step 3 – It performs a backward pass (backpropagation) to compute gradients—the direction each weight should move to reduce the error.
Step 4 – It updates all weights slightly in the direction that lowers the loss. This is one training step.
Step 5 – The process repeats with the next batch, sometimes millions of times, slowly sculpting the weights into a configuration that generalises well.
How these two iterations connect
The training loop produces the weight matrices (W_q, W_k, W_v, and many others) that are later used during inference. The inference loop then uses those fixed weights to generate text token by token. Both are iterative, but training iterates over data batches to adjust weights, while inference iterates over token positions to build an output sequence. The KV cache bridges them: it is a dynamic structure created during inference that stores the keys and values computed using the trained weights.
Iteration is the engine of both learning and generation—a loop that turns static weights into dynamic understanding, and a separate loop that turns that understanding into coherent language.
No comments:
Post a Comment