Explaining Transformer to a friend

I first learned about Transformers from Vaswani et al.’s 2017 paper and a handful of blog posts, but every time I tried to explain the architecture end-to-end I found myself hunting for diagrams and refresher articles. Here’s my simple, fast, no-jargon walkthrough of how Transformers work, from embeddings to generation, plus the questions that puzzled me along the way. Think of this as a single, uninterrupted explanation you can read in one go, without jumping around.

There is this feeling you have where you “kinda” understand a topic in depth, and can relate to the how and why it was made, with the transformer, I at this “kinda” phase, where I still dont fully understand the why and so what now step. This blog is a simple understanding of what it is so that all my friends can understand it. Before moving forward have this opened https://poloclub.github.io/transformer-explainer , as I find it really good to see the transformer working interactively. Also I recently added a comments section to my blogs, so if you think i wrote something wrong, or want to share your thoughts do comment it out!

What is a transformer?

A transformer is basically a sequence-to-sequence model built entirely around attention, ditching the step-by-step vibe of RNNs or LSTMs. every token in your input can “see” every other token at once, making it awesome at capturing long-range relationships and super parallel-friendly on GPUs/TPUs. the original “Attention Is All You Need” paper (Vaswani et al., 2017) stacked six identical encoder layers and six decoder layers and blew older models out of the water on translation tasks.

Word embeddings & subwords

first, you split your sentence into tokens—words or subword pieces—using something like Byte-Pair Encoding (BPE) or a unigram language model. so “unbreakable” might become “un”, “break”, “able,” letting the model handle rare or brand-new words. then each token gets a vector from a lookup table:

1# fetch the embedding for token t
2embedding_vector = EmbeddingTable[t]
3

python

that vector carries semantic info, and we’ll add position info next.

but before that lets take a practical example of how this word to token system works. All code for these little experiments is on Jetbrains datalore , which also follows Build a Large Language Model (From Scratch) book.

if you see the notebook you will see all the cells but here im just sticking them in one code block:

1class SimpleTokenizerV1:
2    def __init__(self, vocab):
3        self.str_to_int = vocab
4        self.int_to_str = {i:s for s,i in vocab.items()}
5    
6    def encode(self, text):
7        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
8                                
9        preprocessed = [
10            item.strip() for item in preprocessed if item.strip()
11        ]
12        ids = [self.str_to_int[s] for s in preprocessed]
13        return ids
14        
15    def decode(self, ids):
16        text = " ".join([self.int_to_str[i] for i in ids])
17        # Replace spaces before the specified punctuations
18        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
19        return text
20
21tokenizer = SimpleTokenizerV1(vocab)
22
23text = """"It's the last he painted, you know," 
24           Mrs. Gisburn said with pardonable pride."""
25ids = tokenizer.encode(text)
26print(ids)

python

If you see this we get

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]

so this is basically converts the word to the tokens which we have set a vocabulary of, if you scroll above in the notebook the vocab is set using all the words a txt file has. If you try a different word like “Yo”, which is not in the vocab we will get error in this ids = tokenizer.encode(text) as the encoder does not know where to map “Yo” to in the map. For example, 56 is mapped to It and if you comment out the break statement, you will see 1126 is mapped to you.

1for i, item in enumerate(vocab.items()):
2    print(item)
3    # if i >= 50:
4    #     break

python

As we talked about Byte-Pair Encoding (BPE) above, which was used by gpt-2 , and shared to us, we will use tiktoken library which gives access to this gpt-2 logic of BPE here how it looks like:

after that we the embeddings layer , lets take four input examples with input ids 2, 3, 5, and 1 (after tokenization which we did just above ), and we have a vocab of just 6 words. And the output dimensions, think of it as the “size of the space” where the model represents each token. it’s how much room the model has to capture different aspects of meaning, context, and relationships for each token. it’s not about the number of outputs, but about how rich or detailed each token’s representation can be.

1input_ids = torch.tensor([2, 3, 5, 1])
2
3vocab_size = 6
4output_dim = 3
5
6torch.manual_seed(123)
7embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
8
9print(embedding_layer.weight)

python

and then we get:

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)

if we just want the embeddigns for the token = 3 we can do embedding_layer(torch.tensor([3])) which gives us tensor([[-0.4015, 0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)

Now what i got confused in whas why there are 2 extra lines if the token sent are 4 and if we see their embeddings we see the 4 embeddings so what are the 2 extra ones when we see all the embeddings?

1print(embedding_layer(input_ids))
2
3#output
4tensor([[ 1.2753, -0.2010, -0.1606],
5        [-0.4015,  0.9666, -1.1481],
6        [-2.8400, -0.7849, -1.4096],
7        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)

python

So the entire this is based on the vocab which is the entire words the model knows. During training, only the embeddings for the token ids that actually appear in your input batch get updated. so if you send [2, 3, 5, 1], only the rows for 2, 3, 5, and 1 in the embedding matrix will get tweaked by backpropagation. the other embeddings (like for 0 and 4) just sit there unchanged until those token ids show up in some future batch.

Positional encoding

attention itself doesn’t know order, so we inject positional encodings into each embedding. the classic sinusoids are:

where pos is the token index and d the embedding size. you just add these to the embeddings so the model can tell “dog bites man” from “man bites dog.”

There are two types of positional encoding:

fixed (sinusoidal) positional embeddings: these are set by a formula and never change during training. they’re just added to the token embeddings so the model knows the order of the tokens.
learned positional embeddings: these are trainable parameters, just like the token embeddings. they start out random and get updated during training, so the model can figure out whatever positional patterns help it do the job better.

Encoder stack

Each encoder layer refines the token vectors in two main steps—self-attention and a feed-forward network (FFN)—with residual skips and layer norms around each (we’ll cover those next). Here’s a walkthrough:

Self-Attention
- What it does: lets every word “ask” every other word, “How much should I pay attention to you?” and then “listen” accordingly.
- Under the hood: for each token vector xxx, we learn three projections, Query (QQQ), Key (KKK), and Value (VVV):
  Q, K, V = x @ W_Q, x @ W_K, x @ W_V
  We score how well each query matches each key, scale, softmax, then mix the values:
  scores = (Q @ K.transpose(-2, -1)) / sqrt(d_k) attn_weights = softmax(scores, dim=-1) context = attn_weights @ V
- Why it matters: the resulting context vector for each token is a custom blend of all the other tokens, highlighting the most relevant ones (e.g. linking “dog” with “bites” even if they’re far apart).
Feed-Forward Network (FFN)
- What it does: processes each token’s context vector independently through a tiny two-layer neural net, adding non-linear “magic.”
- Under the hood:
  hidden = ReLU(context @ W1 + b1) out = hidden @ W2 + b2
- Why it matters: this MLP step lets the model learn patterns that can’t be captured by just averaging or weighting—think of it as giving each token a chance to remix its own features.

Stack six of these layers (as in Vaswani et al., 2017), and each pass gives you richer, more globally informed token vectors.

Residual connections & layer norm

after each self-attention or feed-forward step, we don’t just throw away what we started with—we skip back and normalize. here’s the simple workflow:

normalize your input vector xxx so its scale is nice and stable:
x_norm = LayerNorm(x)
transform it through your sublayer (self-attention or feed-forward):
sub_out = sublayer(x_norm)
drop out a few dimensions for regularization:
sub_out = Dropout(sub_out)
add that back onto your original xxx:
x = x + sub_out

that “skip connection” means each layer only has to learn a small tweak, not a totally new representation, and the layer norm keeps values from blowing up or vanishing. This combo is what lets you stack six (or sixty) layers and still train stably.

Decoder stack

The decoder is like a mirrored encoder, but with three sublayers instead of two:

Masked Self-Attention
- What it does: lets each generated token look back at all the tokens it’s already produced—but not peek at the future ones.
- Why mask?: prevents cheating by zeroing out (masking) attention scores for positions beyond the current step.
Encoder–Decoder Attention
- What it does: now that each decoder position has its own queries, it “talks” to the encoder’s output vectors (its keys & values).
- Why it matters: gives the decoder direct access to the full input context. For example, when choosing the next French word, it can focus on the most relevant English tokens.
Feed-Forward Network (FFN)
- What it does: same tiny two-layer MLP as in the encoder, adding a bit of non-linear flair per token.

Each of these three is wrapped in a pre-norm + residual pattern:

1# for each sublayer in [masked-self, cross-attn, ffn]:
2x_norm = LayerNorm(x)
3sublayer_out = sublayer(x_norm)
4sublayer_out = Dropout(sublayer_out)
5x = x + sublayer_out

python

Putting it all together

Translating or generating any sequence is just these ordered steps:

Embed & position-encode your input tokens → a batch of vectors.
Run those vectors through all six encoder layers → “memory” vectors (one per input token).
Seed the decoder with a special <start> token.
Repeat until you hit <end> or a length cap:
- Masked self-attention over your generated tokens so far
- Encoder–decoder attention over the encoder’s memory vectors
- Feed-forward + residual + norm
- Linear projection → softmax → probabilities
- Pick the next token (greedy or sampled) and append it to the decoder’s input

And that’s it—each pass through the decoder adds exactly one word to the output, always consulting both its own history and the encoder’s understanding of the original sentence.

i had this question too: suppose the task is translation, and we want to translate it into another language. after the 6 encoder layers, you have 5 vectors (one for each input word), each packed with context about the whole sentence. how does the decoder use these to generate the translation? here’s the answer: at each step, the decoder’s masked self-attention looks at what it’s generated so far and then the encoder–decoder attention uses those 5 context vectors as keys and values, letting the decoder figure out which parts of the input sentence to focus on when predicting the next word. once it has that context, it goes through the FFN, linear projection, softmax, and spits out the next token. then it tacks that token onto its input and does it all again.

Sampling & temperature

I also had this question of how to change or vary the transformer output, if you are viewing the interactive chart try changing the temperature thingy on to left. The temperature divides logits by T before softmax—low T (0.1–0.2) is greedy, high T (1.0+) is more adventurous. you can also use top-k (sample from the k most likely tokens) or top-p (nucleus) sampling (smallest set whose cumulative prob ≥ p) to control randomness.

Conclusion & group-chat analogy

imagine a big group chat where each “round,” everyone reads every message (self-attention), writes a quick summary of their thoughts (FFN + residual), and passes it on. the decoder is like a smaller group drafting a reply: they read the full original chat and their draft so far, then add one word at a time until they hit “send.” that iterative global listening and summarizing—layer after layer—is what makes Transformers simple and powerful.

Thats it! I really hope you understood the gist of it, as my motive was to let you understand the meaning, so you get comfortable with the terms and can learn the advanced topics with these smaller examples in mind. For more depth I would recommend reading this https://jalammar.github.io/illustrated-transformer .

Explaining Transformer to a friend

What is a transformer?

Word embeddings & subwords

Positional encoding

Encoder stack

Residual connections & layer norm

Decoder stack

Putting it all together

Sampling & temperature

Conclusion & group-chat analogy

More from the same blog category

Building Google Translate from Scratch (kinda)

Comments