A Transformer is a deep learning architecture introduced in the 2017 paper 'Attention Is All You Need' by Vaswani et al. It revolutionized natural language processing and has since become the backbone of modern AI models like GPT, BERT, and T5.
A Transformer is a neural network architecture designed to process sequential data — most notably text — without relying on recurrence or convolution. Instead, it uses a mechanism called self-attention to weigh the relevance of every token in a sequence against every other token simultaneously. This parallel design made it dramatically faster to train than previous RNN or LSTM-based models. It consists of two main stacks: an Encoder and a Decoder, though many modern models use only one of the two.
Self-attention allows the model to focus on different parts of the input when producing each output token. For each token, three vectors are computed — a Query, a Key, and a Value — derived by multiplying the token's embedding by learned weight matrices. Attention scores are calculated as the scaled dot product of Queries and Keys, then passed through a softmax to produce weights that blend the Value vectors. This lets the model capture long-range dependencies (e.g. linking a pronoun back to its noun) regardless of how far apart they are in the sequence.
Multi-head attention runs several self-attention operations in parallel, each with its own Q/K/V weight matrices, then concatenates their outputs. This lets the model jointly attend to information from different representation subspaces at different positions. Because Transformers process all tokens simultaneously rather than sequentially, they have no built-in sense of order, so positional encodings — fixed or learned vectors added to each token embedding — inject position information. Without positional encodings, the model would treat 'dog bites man' and 'man bites dog' as identical.
The Encoder maps an input sequence into a rich contextual representation; models like BERT use only the encoder and excel at understanding tasks such as classification and question answering. The Decoder generates output tokens one at a time, using both self-attention over previously generated tokens and cross-attention over the encoder's output; models like GPT use only the decoder and excel at generation tasks. Sequence-to-sequence tasks like translation use both stacks together. Each stack is composed of repeated layers containing attention sub-layers and feed-forward sub-layers with residual connections and layer normalization.
Transformers enabled unprecedented scaling: simply adding more parameters and data consistently improved performance, leading to large language models (LLMs) with billions of parameters. They generalize far beyond text — Vision Transformers (ViT) apply them to images, and they are used in protein folding (AlphaFold), audio, and robotics. Pre-training on massive unlabeled corpora followed by fine-tuning on specific tasks became the dominant paradigm in AI. This transfer learning capability means a single pre-trained model can be adapted to dozens of downstream tasks with relatively little data.
The self-attention mechanism has quadratic time and memory complexity relative to sequence length — doubling the sequence length quadruples the compute cost. This makes naive Transformers impractical for very long contexts, motivating techniques like Flash Attention, sparse attention, and sliding window attention. Always use a learning rate warm-up schedule when training Transformers from scratch; they are notoriously sensitive to the learning rate early in training. Gradient clipping and careful weight initialization (e.g. scaled initialization for deep stacks) are also standard best practices to prevent instability.
© RM Full Stack & AI Engineer · All guides · Roadmaps · Open the app