Every word looks at every other word at once. No information lost to distance.
Imagine a room where everyone can hear everyone else at the same time. Now imagine a phone chain where each person only hears the one before them. Attention is the room.
Use the arrows below, the dots above, or your keyboard arrow keys to move through the stages.
The old way. Left to right, losing information with every step. By the end, the model barely remembers that "cat" is the subject of "ran." Long-distance relationships are lost.
The goal of this stage is simple: the action word ran should reconnect directly to cat. Watch the important words lift up, then follow the curved lines.
With attention, every word sees every other word. "Ran" looks directly at "cat" regardless of distance. No information lost.
This stage shows why the same word can mean different things. Click the purple word and follow the arrows to the strongest context clues.
Multi-head attention means the model does not use only one spotlight. Different heads look for grammar, references, and meaning at the same time.
Multiple heads work simultaneously -- grammar, meaning, references, all in parallel. Each head specializes in a different type of relationship.
Word order and structure matter in your prompts. Clear sentences produce cleaner attention patterns and better results.
Attention connects words to context. But this is one step in the assembly line. What does the full pipeline look like? →