How AI Understands Context

The Analogy

Imagine each new speaker can hear everyone who spoke earlier, not just the previous person. During generation, a causal mask keeps future speakers unheard until their turn.

Use the arrows below, the dots above, or your keyboard arrow keys to move through the stages.

Stage 1 -- The Problem

Before Attention: The Phone Chain

Memory of earlier words

The old way. Left to right, losing information with every step. By the end, the model barely remembers that "cat" is the subject of "ran." Long-distance relationships are lost.

Stage 2 -- The Breakthrough

Attention: The Room

The goal of this stage is simple: the action word ran should reconnect directly to cat. Watch the important words lift up, then follow the curved lines.

With attention, "ran" can connect directly to earlier context such as "cat." During generation, a simple causal mask hides future tokens—like covering the unread part of a sentence.

Stage 3 -- Explore Attention

Click the Highlighted Word

This stage shows why the same word can mean different things. Click the purple word and follow the arrows to the strongest context clues.

Click the pulsing purple target word to reveal the attention pattern.

Select an example above, then click the purple word to see where it focuses its attention.

Stage 4 -- Multi-Head Attention

Three Heads, Three Perspectives

Multi-head attention means the model does not use only one spotlight. Different heads look for grammar, references, and meaning at the same time.

Click a head above to see what it pays attention to.

Multiple heads work simultaneously -- grammar, meaning, references, all in parallel. Each head specializes in a different type of relationship.