Transformer AI: Definition, Architecture & Self-Attention

A Transformer is a deep learning model architecture that eschews recurrent or convolutional layers in favor of a self-attention mechanism. This mechanism allows the model to weigh the importance of all elements in an input sequence simultaneously when processing any single element, enabling it to capture complex, long-range contextual relationships. The architecture's parallelizable nature, stemming from its lack of sequential dependencies, allows for efficient training on modern hardware accelerators like GPUs and TPUs.

Introduced in the seminal 2017 paper "Attention Is All You Need," the Transformer consists of an encoder-decoder structure, though variations like encoder-only (e.g., BERT) or decoder-only (e.g., GPT) models are common. Its core components are the multi-head attention layer, which runs several self-attention operations in parallel, and the position-wise feed-forward network. Positional encodings are added to the input embeddings to provide the model with information about the order of the sequence, which it otherwise lacks due to its permutation-invariant attention mechanism.

The Transformer is a deep learning architecture that revolutionized sequence modeling by replacing recurrent layers with a self-attention mechanism, enabling parallel processing and superior handling of long-range dependencies.

Since the self-attention mechanism is inherently permutation-invariant, positional encodings are added to the input embeddings to inject information about the order of tokens in the sequence.

Sinusoidal Functions: The original Transformer uses fixed, pre-defined sine and cosine functions of different frequencies to encode absolute position.
Learned Embeddings: Modern implementations (e.g., BERT) often use learned positional embeddings, treating each position index as a token to be embedded.
Relative Position: Advanced variants use mechanisms to directly model the relative distance between tokens, which can generalize better to longer sequences than training time.

The original Transformer architecture is designed for sequence-to-sequence tasks (like translation) using a stack of encoder layers to process the input and a stack of decoder layers to generate the output.

Encoder: Processes the input sequence bidirectionally, building a rich contextual representation for every input token.
Decoder: Generates the output sequence auto-regressively (one token at a time). It uses masked self-attention to prevent attending to future tokens and cross-attention to attend to the encoder's output.
Layer Stacks: Both encoder and decoder are composed of identical layers (e.g., 6 layers in the original paper), each containing multi-head attention and feed-forward networks.

Each attention sub-layer is followed by a simple, position-wise Feed-Forward Network (FFN). This is applied independently and identically to each position in the sequence.

Two Linear Transformations: Typically structured as: FFN(x) = max(0, xW1 + b1)W2 + b2. The ReLU activation in between provides non-linearity.
Dimensional Expansion: The inner layer (hidden dimension) is often 4x larger than the model dimension (e.g., 2048 vs. 512), acting as an expansion and compression step that adds model capacity.
Role: While attention mixes information across positions, the FFN processes and transforms information at each position.

Critical engineering components that enable stable training of very deep Transformer stacks.

Residual Connections: Each sub-layer (attention, FFN) has a residual connection around it, formulated as LayerOutput(x) = LayerNorm(x + Sublayer(x)). This helps mitigate the vanishing gradient problem.
Layer Normalization: Applied after the residual addition, normalizing the activations across the feature dimension for each token independently. This stabilizes training dynamics and reduces sensitivity to initialization.
Pre-Norm vs. Post-Norm: Modern architectures often use Pre-LayerNorm, applying normalization before the sub-layer, which is generally more stable for deep networks than the original Post-LayerNorm.

Introduced in the 2017 paper "Attention Is All You Need," the Transformer architecture replaces sequential recurrence with a self-attention mechanism. This allows the model to weigh the importance of all elements in an input sequence simultaneously, regardless of their distance, facilitating parallel training and capturing complex contextual relationships. Its core components are the encoder and decoder stacks, which process input tokens through layers of multi-head attention and feed-forward neural networks.

The architecture's efficiency stems from parallelization and positional encoding, which injects information about token order. This design became the foundation for modern large language models (LLMs) like GPT and BERT. Beyond natural language processing, Transformers are now pivotal in computer vision (Vision Transformers), audio processing, and multi-modal AI, demonstrating their versatility as a general-purpose sequence modeling framework.

A Transformer is a deep learning architecture that uses a self-attention mechanism to process all elements of an input sequence simultaneously, enabling it to capture long-range dependencies more effectively than previous sequential models like RNNs or LSTMs. Its core innovation is replacing recurrence with scaled dot-product attention, which computes a weighted sum of all other tokens in the sequence for each token. This allows the model to directly model relationships between any two positions, regardless of distance. The standard architecture consists of an encoder (which creates contextualized representations of the input) and a decoder (which generates an output sequence auto-regressively), each built from a stack of identical layers containing multi-head attention and feed-forward neural networks. Positional information is injected via positional encodings since the model itself has no inherent notion of sequence order.

The Transformer architecture is built upon and enables several core machine learning concepts. These related terms define its components, training paradigms, and the broader context of sequence modeling.

Transformer

What is a Transformer?

Core Architectural Components

Self-Attention Mechanism

Multi-Head Attention

Positional Encoding

Encoder-Decoder Structure

Feed-Forward Network

Layer Normalization & Residual Connections

How the Transformer Architecture Works

Frequently Asked Questions

Self-Attention Mechanism

Encoder-Decoder Architecture

Positional Encoding

Large Language Model (LLM)

Vision Transformer (ViT)

Multi-Head Attention

Transformer

What is a Transformer?

Core Architectural Components

Self-Attention Mechanism

Multi-Head Attention

Positional Encoding

Encoder-Decoder Structure

Feed-Forward Network

Layer Normalization & Residual Connections

How the Transformer Architecture Works

Frequently Asked Questions

Related Terms

Self-Attention Mechanism

Encoder-Decoder Architecture

Positional Encoding

Large Language Model (LLM)

Vision Transformer (ViT)

Multi-Head Attention