Attention Is All You Need

The paper that found what was waiting to be found

In June 2017, eight researchers at Google published a paper that would, within a few years, reshape the entire landscape of artificial intelligence. The paper was titled "Attention Is All You Need", and it introduced the Transformer -- a neural network architecture that discarded the recurrent computation paradigm that had dominated sequence modeling for nearly three decades. In its place, the authors proposed something that, in retrospect, seems almost inevitable: let the model attend to every part of the input simultaneously, at every layer, with no notion of sequential processing whatsoever.

The result was not incremental. It was a discontinuity. The Transformer achieved state-of-the-art results on machine translation benchmarks while training in a fraction of the time required by previous architectures. More importantly, it opened a design space that nobody had fully anticipated -- one where model performance could scale predictably with compute, data, and parameter count. Everything that followed -- BERT, GPT, PaLM, Claude, Gemini, the entire large language model revolution -- traces its lineage directly to this paper. It is, in hindsight, difficult to imagine a timeline where something like the Transformer was not discovered.

This is an attempt to explain what the paper actually contains, why it mattered, and what it set in motion.

The World Before Transformers

To understand why the Transformer was significant, you need to understand what it replaced. By 2017, the dominant approach to sequence-to-sequence tasks -- machine translation, summarization, language modeling -- was built on recurrent neural networks, specifically Long Short-Term Memory networks (LSTMs) and their close relative, the Gated Recurrent Unit (GRU).

Recurrent neural networks process sequences one element at a time. Given an input sequence of tokens (words, subwords, characters), an RNN maintains a hidden state vector that gets updated at each time step. The hidden state at position t is a function of the hidden state at position t-1 and the current input. This creates an inherently sequential computation: you cannot compute the representation at position 10 without first computing the representations at positions 1 through 9.

LSTMs, introduced by Hochreiter and Schmidhuber in 1997, addressed the vanishing gradient problem that plagued vanilla RNNs by introducing gating mechanisms -- input gates, forget gates, and output gates -- that allowed the network to selectively retain or discard information over long sequences. This was a meaningful improvement, and LSTMs became the workhorse of natural language processing for nearly two decades.

The encoder-decoder architecture, popularized by Sutskever, Vinyals, and Le in 2014, gave these recurrent networks a framework for mapping variable-length input sequences to variable-length output sequences. The encoder reads the entire input and compresses it into a fixed-length context vector; the decoder then generates the output sequence conditioned on that vector. This was the foundation of neural machine translation.

The Attention Mechanism (Before the Transformer)

The critical weakness of the basic encoder-decoder was the information bottleneck: the entire input had to be compressed into a single fixed-dimensional vector. For long sequences, this was catastrophic. Bahdanau, Cho, and Bengio proposed the attention mechanism in 2014 to solve exactly this problem. Instead of relying on a single context vector, their approach allowed the decoder to look back at all encoder hidden states and compute a weighted combination of them at each decoding step.

The weights were computed dynamically -- a small alignment network learned which parts of the input were most relevant to each part of the output. This was transformative for translation quality. A model translating "Le chat est sur le tapis" to "The cat is on the mat" could now, while generating "cat," attend strongly to "chat" rather than relying on whatever information survived the compression into a single vector.

Luong et al. followed with simplified variants of attention in 2015, and attention mechanisms quickly became standard. But there was a catch: attention was an addition to recurrent networks, not a replacement. The underlying computation was still sequential. The encoder still had to process the input one token at a time. The decoder still generated output one token at a time. Attention helped the model remember, but it did nothing to address the fundamental computational bottleneck of recurrence.

The Parallelism Problem

This sequential dependency was not merely an architectural inconvenience. It was a hardware bottleneck. GPUs derive their power from massive parallelism -- thousands of cores executing identical operations simultaneously across large tensors. But recurrent networks, by their very nature, serialize computation across the time dimension. You cannot compute the hidden state at step 100 until you have finished steps 1 through 99. This meant that training RNNs on long sequences was slow, and scaling them to larger models and datasets hit diminishing returns in wall-clock time regardless of how much hardware you threw at the problem.

Convolutional approaches to sequence modeling (like ByteNet and ConvS2S) attempted to address the parallelism issue by replacing recurrence with convolutional layers that could process all positions simultaneously. These models were faster to train, but they introduced a different problem: the receptive field of a convolutional network grows logarithmically with depth, meaning that capturing long-range dependencies required either very deep networks or dilated convolutions with increasing kernel sizes. The computational cost of relating two distant positions in the input grew with their distance.

This was the landscape in mid-2017. Recurrent models were expressive but slow. Convolutional models were fast but struggled with long-range dependencies. The field needed something that was both parallelizable and capable of relating any two positions in a sequence at constant computational cost. The pressure was there. The hardware was there. The mathematical ingredients were all sitting on the table. Someone was going to assemble them.

The Core Innovation: Self-Attention

The central insight of "Attention Is All You Need" is in the title. The authors proposed discarding recurrence and convolution entirely and building the entire model out of attention mechanisms. Not attention as an auxiliary component bolted onto an RNN -- attention as the sole computational primitive.

The specific mechanism they introduced is called scaled dot-product attention. Given a set of queries, keys, and values -- all derived from the same input sequence via learned linear projections -- the attention function computes a weighted sum of the values, where the weight assigned to each value is determined by the compatibility between the corresponding key and the query.

Formally, for a matrix of queries Q, keys K, and values V:

Attention(Q, K, V) = softmax(QKᵀ / √dk) V

Here, dk is the dimensionality of the key vectors, and the division by √dk is a scaling factor that prevents the dot products from growing too large in magnitude, which would push the softmax into regions of extremely small gradients. This scaling detail is easy to overlook, but the authors found it essential for stable training at higher dimensions.

The profound aspect of this formulation is what it implies computationally. Every position in the sequence attends to every other position in a single operation. There is no notion of distance. There is no sequential dependency. Position 1 can attend to position 1000 just as easily as it can attend to position 2. And critically, the entire computation -- the matrix multiplications, the softmax, the weighted sum -- is fully parallelizable across all positions simultaneously. This is what makes the Transformer fundamentally different from everything that came before. It is not a clever approximation of how brains process language. It is something new -- a mechanism that evolution never stumbled upon, but that, once found, turns out to be unreasonably effective.

Queries, Keys, and Values

The query-key-value abstraction deserves careful explanation, because it is the conceptual engine of the entire architecture. The terminology is borrowed from information retrieval. A query is a representation of what a particular position is looking for. A key is a representation of what a particular position has to offer. A value is the actual content that will be retrieved.

In self-attention, all three are derived from the same input. Given an input matrix X (where each row is the embedding of one token), the model computes Q = XWQ, K = XWK, and V = XWV, where WQ, WK, and WV are learned parameter matrices. The separation into three projections gives the model the flexibility to learn different representations for the role of querying, the role of being matched against, and the role of being retrieved. A single token can have a query representation that differs substantially from its key representation.

The dot product QKT produces a matrix of attention scores -- one score for every pair of positions in the sequence. Applying softmax row-wise turns each row into a probability distribution over all positions, indicating how much each position should attend to every other position. Multiplying by V then computes the weighted combination.

Multi-Head Attention

A single attention operation computes one set of attention weights -- one "view" of how the positions in the sequence relate to each other. The authors recognized that this is limiting. In natural language, tokens relate to each other in many simultaneous ways: syntactically, semantically, positionally, referentially. A pronoun needs to attend to its antecedent. A verb needs to attend to its subject. An adjective needs to attend to its noun. These are different relationships that benefit from different learned representations.

Multi-head attention addresses this by running multiple attention operations in parallel, each with its own learned projection matrices. If the model dimension is dmodel = 512 and there are h = 8 heads, each head operates on projections of dimension dk = dv = dmodel / h = 64.

MultiHead(Q, K, V) = Concat(head1, ..., headh) WO
where headi = Attention(QWiQ, KWiK, VWiV)

The outputs of all heads are concatenated and passed through a final linear projection WO. The total computational cost is comparable to a single attention operation with full dimensionality, because each head operates on a reduced dimension. But the model gains the ability to jointly attend to information from different representation subspaces at different positions.

In practice, researchers have observed that different heads in trained Transformers learn to specialize: some heads track syntactic dependencies, others attend to adjacent tokens, others capture long-range coreference patterns. The multi-head mechanism is not just a computational trick -- it provides a structured way for the model to decompose the complex relational structure of language into parallel channels of information flow.

Positional Encoding

Self-attention, as described above, has a notable property: it is entirely permutation-invariant. Because the attention operation treats the input as a set of vectors and computes pairwise interactions between all of them, it has no intrinsic notion of order. If you shuffled the tokens in the input sequence, the attention weights would change only because the token embeddings changed, not because the mechanism itself encodes any positional information. For language modeling, where word order is essential to meaning, this is a problem.

The authors solve this by adding positional encodings to the input embeddings. These are vectors of the same dimension as the token embeddings, and they are simply added element-wise before the first layer of the model. The original paper uses sinusoidal functions of different frequencies:

PE(pos, 2i) = sin(pos / 100002i/dmodel)
PE(pos, 2i+1) = cos(pos / 100002i/dmodel)

Each dimension of the positional encoding corresponds to a sinusoid with a different wavelength, forming a geometric progression from 2π to 10000 · 2π. The authors chose sinusoidal encodings for a specific reason: they hypothesized that this would allow the model to learn to attend to relative positions, because for any fixed offset k, the positional encoding at position pos + k can be represented as a linear function of the encoding at position pos. This is a consequence of the additive properties of sine and cosine functions.

The paper notes that learned positional embeddings produced nearly identical results in their experiments. Later work would explore many variations: relative positional encodings (Shaw et al., 2018), rotary position embeddings (Su et al., 2021), and ALiBi (Press et al., 2022). The fact that the specific form of positional encoding is somewhat flexible suggests that the Transformer architecture is robust to this design choice -- what matters is that some positional signal is present, not the exact form it takes.

The Full Architecture

The Transformer follows the encoder-decoder pattern that was already standard for machine translation, but replaces the recurrent layers with stacked self-attention and feed-forward layers.

The Encoder

The encoder consists of N = 6 identical layers. Each layer has two sub-layers: a multi-head self-attention mechanism, and a position-wise fully connected feed-forward network. Each sub-layer is wrapped in a residual connection followed by layer normalization:

output = LayerNorm(x + Sublayer(x))

The residual connections are critical. They allow gradients to flow directly through the network without attenuation, enabling the training of deep stacks. Layer normalization stabilizes the activations at each layer. Together, these two techniques are what make it possible to stack six (or, in later models, dozens or hundreds of) layers without training instability.

The feed-forward network is applied identically and independently to each position. It consists of two linear transformations with a ReLU activation in between:

FFN(x) = max(0, xW1 + b1) W2 + b2

The inner dimension of this feed-forward layer is 2048, four times the model dimension of 512. This expansion-contraction pattern -- projecting to a higher dimension, applying a nonlinearity, then projecting back -- gives each layer a substantial amount of per-position processing capacity. Later research would identify these feed-forward layers as repositories of factual knowledge in the model, functioning almost like key-value memories.

The Decoder

The decoder is also composed of 6 identical layers, but each layer has three sub-layers instead of two. The first is a masked multi-head self-attention mechanism. The masking ensures that predictions for position i can only depend on known outputs at positions less than i -- this preserves the autoregressive property necessary for generation. The second sub-layer performs multi-head attention over the encoder output, allowing the decoder to attend to all positions in the input sequence. The third is the same position-wise feed-forward network used in the encoder.

The masking in the decoder self-attention is implemented by setting the attention scores of all "illegal" connections (positions in the future) to negative infinity before the softmax, which effectively zeros out those attention weights. This is computationally elegant: the same parallel attention machinery is used, but the causal structure is enforced through a simple mask matrix.

The Final Linear Layer and Softmax

The decoder output is passed through a linear transformation followed by a softmax to produce the probability distribution over the output vocabulary for the next token. The authors tie the weights of this output embedding matrix with the input embedding matrix, a technique called weight tying that reduces the total parameter count and was shown by Press and Wolf (2017) to improve performance.

Why It Worked

The Transformer outperformed existing models on the WMT 2014 English-to-German and English-to-French translation benchmarks while requiring significantly less training time. The English-to-French model achieved a BLEU score of 41.0, surpassing all previously published single models and ensembles, after training for just 3.5 days on eight GPUs. The reasons for this success are worth examining carefully.

Parallelization

The most immediate advantage is training speed. Because self-attention computes all pairwise interactions in a single matrix multiplication, and because these operations are independent across positions, the entire forward pass of a Transformer layer can be computed in parallel on a GPU. In contrast, an RNN with the same number of parameters would need to process the sequence step-by-step. The paper reports that their base model trained in 12 hours on 8 P100 GPUs. Achieving comparable translation quality with recurrent models took days or weeks.

Constant-Distance Dependencies

In an RNN, information from position 1 must traverse every intermediate hidden state to influence the computation at position 100. Each traversal introduces the possibility of signal degradation. LSTMs mitigate this through gating, but they do not eliminate it. In a Transformer, any position can attend to any other position in a single step. The maximum path length between any two positions is O(1) in terms of self-attention operations, compared to O(n) for recurrent networks and O(log n) for convolutional networks. This makes it fundamentally easier for the model to learn long-range dependencies.

Expressiveness of Attention Patterns

Self-attention computes a rich, data-dependent connectivity pattern at every layer. Unlike convolutions, which apply fixed-size filters, or recurrent connections, which follow a fixed sequential topology, attention weights are computed dynamically based on the content of the sequence. This means the effective computational graph of the model adapts to each input. When processing a sentence with a long-distance dependency, the model can learn to route information directly from the dependent element to where it is needed, without relying on intermediate representations to carry that information forward.

Interpretability

The attention weight matrices provide a degree of interpretability that was largely absent from recurrent models. By examining which positions attend to which other positions, researchers can gain insight into what the model has learned. The paper includes visualizations showing that attention heads learn linguistically meaningful patterns -- certain heads specialize in syntactic dependencies, others track coreference, and others capture phrasal structure. This interpretability, while imperfect, made the Transformer more amenable to analysis and debugging than its predecessors.

Training Details That Mattered

Several training choices in the paper were important to the model's success and became standard practice in later work. The authors used the Adam optimizer with a custom learning rate schedule: the learning rate increases linearly during a warmup phase, then decreases proportionally to the inverse square root of the step number. This "warmup then decay" schedule turned out to be important for stable training of Transformer models and persisted, in various forms, throughout subsequent research.

They applied dropout to the output of each sub-layer (before the residual connection), to the attention weights themselves, and to the sum of token and positional embeddings. The base model used a dropout rate of 0.1. They also employed label smoothing with a value of 0.1, which hurts perplexity slightly but improves BLEU score by encouraging the model to be less confident in its predictions.

The model was trained with byte pair encoding (BPE) tokenization, using a shared source-target vocabulary of approximately 37,000 tokens. BPE had been introduced for neural machine translation by Sennrich et al. in 2016 and would become the dominant tokenization strategy for all subsequent language models, including GPT and BERT.

What Came After

The impact of "Attention Is All You Need" is difficult to overstate. Within two years of its publication, the Transformer became the dominant architecture not just for machine translation, but for essentially all of natural language processing, and then for computer vision, speech recognition, protein structure prediction, and dozens of other domains. The paper has been cited over 170,000 times as of this writing. But the real impact is measured in the architectures and systems it enabled.

BERT (2018)

Devlin et al. at Google used the encoder portion of the Transformer to create BERT (Bidirectional Encoder Representations from Transformers). The key insight was that pre-training a Transformer encoder on a large corpus using masked language modeling -- randomly masking tokens and training the model to predict them from context -- produced representations that could be fine-tuned for a wide range of downstream tasks. BERT achieved state-of-the-art results on eleven NLP benchmarks simultaneously and fundamentally shifted the field toward pre-train-then-fine-tune paradigms.

GPT and GPT-2 (2018-2019)

OpenAI took the opposite approach: rather than the encoder, they used the decoder portion of the Transformer as a language model. GPT (Generative Pre-trained Transformer) demonstrated that a Transformer trained autoregressively on a large text corpus -- simply predicting the next token -- could then be fine-tuned for downstream tasks. GPT-2 scaled this up to 1.5 billion parameters and showed that the model could generate remarkably coherent long-form text, sparking both excitement and concern about the capabilities of large language models.

The Scaling Laws

Perhaps the most consequential downstream discovery was that Transformer performance scales predictably with three factors: the number of parameters, the amount of training data, and the amount of compute. Kaplan et al. at OpenAI formalized this in their 2020 paper on "Scaling Laws for Neural Language Models," showing that loss follows a power-law relationship with each of these factors across many orders of magnitude.

This was not true, or at least not as cleanly true, for recurrent architectures. RNNs hit diminishing returns much earlier as model size increased, partly because their sequential processing bottleneck made training inefficient, and partly because the recurrent structure itself may have limited the model's ability to exploit additional capacity. The Transformer's fully parallel, position-invariant architecture turned out to be uniquely suited to scaling -- a property that nobody anticipated when the paper was written, but which, once observed, had the feeling of a law that was always there, waiting to be measured.

The scaling laws gave researchers and organizations a roadmap: if you want better performance, train a bigger Transformer on more data with more compute. This led directly to GPT-3 (175 billion parameters), PaLM (540 billion), and the arms race in compute that has defined the AI industry since 2020. Chinchilla (Hoffmann et al., 2022) refined the scaling laws by showing that models were typically undertrained relative to their size, and that optimal performance for a given compute budget requires balancing model size and training data -- but the fundamental insight that Transformers scale remained intact.

Beyond Language

The Vision Transformer (ViT), introduced by Dosovitskiy et al. at Google in 2020, applied the Transformer architecture to image classification by treating an image as a sequence of patches. This achieved competitive performance with convolutional neural networks and, when trained on sufficient data, surpassed them. The success of ViT demonstrated that the Transformer was not a language-specific architecture but a general-purpose sequence (or set) processing engine.

AlphaFold 2 used an attention-based architecture to solve protein structure prediction, one of the grand challenges of biology. DALL-E and Stable Diffusion use Transformers as components in image generation systems. Whisper uses a Transformer for speech recognition. The architecture has become the default substrate for computation over structured data, regardless of modality.

The Deeper Significance

"Attention Is All You Need" is one of those rare papers where the significance extends far beyond the specific results reported. The WMT translation numbers were impressive in 2017 but would be considered modest by current standards. What endured was the architecture -- and, more importantly, the computational paradigm it represented.

The Transformer demonstrated that you do not need inductive biases about sequential structure to process sequential data. You do not need to hardcode the assumption that nearby elements are more related than distant ones. You can let the model learn its own connectivity patterns from data, and if you give it enough data and compute, it will learn patterns that are far richer and more nuanced than anything a human designer would specify.

This is a profound conceptual shift. Earlier architectures were designed around human intuitions about the structure of the problem: sequences are processed sequentially (RNNs), spatial patterns are detected by local filters (CNNs), hierarchical structure requires recursive processing (Tree-LSTMs). The Transformer says: none of that is necessary. Give the model a general-purpose mechanism for relating any element to any other element, and it will figure out the structure on its own. This is less of an invention and more of a discovery -- like finding a simpler equation that was always underneath the more complicated ones.

The success of this approach is, in some sense, a vindication of the "bitter lesson" articulated by Rich Sutton: general methods that leverage computation scale better than methods that incorporate human knowledge about the domain. The Transformer is a remarkably generic architecture. It knows nothing about language, nothing about syntax, nothing about semantics. It is a function from sequences of vectors to sequences of vectors, with learned pairwise interactions between positions. And yet, when trained on enough text, it gives rise to systems that can write poetry, prove theorems, hold coherent conversations, and reason about their own reasoning. The gap between what the architecture is and what it gives rise to is, arguably, the most interesting phenomenon in computer science right now.

The Computational Cost of Universality

Self-attention is not without its costs. The fundamental operation -- computing all pairwise interactions between positions -- scales quadratically with sequence length. For a sequence of length n, the attention mechanism requires O(n2) computation and memory. This was not a significant issue when the paper was published, as machine translation typically involved sequences of a few hundred tokens. But as Transformers were applied to longer and longer contexts -- documents, books, code repositories -- the quadratic cost became a serious constraint.

A substantial body of subsequent research has addressed this limitation. Sparse attention patterns (Child et al., 2019), linear attention (Katharopoulos et al., 2020), Flash Attention (Dao et al., 2022), and various approaches to extending context length have all sought to tame the quadratic cost. More recently, state-space models like Mamba (Gu and Dao, 2023) have revisited the idea of recurrence-like architectures with linear scaling, though they draw heavily on the lessons and training techniques developed for Transformers. The question of whether self-attention can be replaced or approximated for very long contexts remains an active area of research, but the core Transformer architecture has shown remarkable resilience.

The Authors and the Context

The eight authors of the paper -- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin -- were working at Google Brain and Google Research at the time. Several have since gone on to found or co-found significant AI companies. Vaswani and Shazeer co-founded Essential AI and Character.AI respectively (Shazeer later returned to Google). Gomez co-founded Cohere. Polosukhin co-founded NEAR Protocol. The paper's intellectual lineage has thus propagated not just through citations but through institutions.

It is worth noting that the core ideas in the paper did not emerge from a vacuum. Attention mechanisms, as noted above, had been in use since 2014. The decomposition of attention into queries, keys, and values drew on ideas from content-addressable memory systems. The idea of using attention without recurrence had been explored in contemporaneous work. What Vaswani et al. achieved was the synthesis: the specific combination of multi-head attention, positional encoding, residual connections, layer normalization, and the encoder-decoder structure, together with the demonstration that this combination could outperform much more complex systems. The engineering and experimental rigor of the paper -- the training recipes, the ablation studies, the careful reporting of results -- was as important as the architectural innovation.

Reading the Paper Today

"Attention Is All You Need" is remarkably concise. The core architecture is described in about seven pages. The writing is clear, the notation is clean, and the paper does not overstate its claims. The abstract says, simply: "We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely." There is no hyperbole, no grand proclamation about the future of AI. The authors knew they had built something effective for machine translation. They could not have known they had built the substrate for a technological revolution.

The paper's Table 1, which compares the maximum path length, computational complexity, and sequential operations across different architectures, is perhaps the single most important table in modern AI research. It lays out, in stark quantitative terms, why self-attention is the right primitive: O(1) maximum path length for relating any two positions, O(n2 · d) total computation (which, for typical sequence lengths and model dimensions, is dominated by the matrix multiplications that GPUs excel at), and O(1) sequential operations.

The ablation studies in Section 5.4 are also illuminating. They show that the number of attention heads matters (8 heads outperform 1 head), that the attention key size matters, that bigger models perform better, and that dropout is important. These findings, modest as they seem, established the design space that subsequent researchers would explore for years.

The Legacy

If you are interacting with a large language model today -- asking it questions, having it write code, requesting it to analyze data -- you are using a direct descendant of the architecture described in this paper. The intervening years have brought many improvements: better tokenization, better training techniques, RLHF for alignment, mixture-of-experts architectures for efficiency, and vastly larger scale. But the core computational primitive -- multi-head self-attention over a sequence of learned representations -- remains essentially unchanged from what Vaswani et al. described in 2017.

This is extraordinary. In a field that moves as fast as machine learning, for a single architecture to remain the foundation of the entire field nine years after its introduction is almost unprecedented. Convolutional neural networks dominated computer vision for roughly a decade before being supplanted by (Transformer-based) vision transformers. LSTMs dominated sequence modeling for about twenty years. The Transformer may prove equally long-lived, or it may be superseded. But whatever comes next will be built by researchers who think in the language the Transformer taught them: attention, residual streams, layer norms, and the scaling laws that emerge from their combination.

The paper ends with a characteristically understated final sentence: "We are excited about the future of attention-based models and plan to apply them to other tasks." They did. And so did everyone else. And the thing they set in motion has not stopped accelerating.