The transformer architecture was developed at Google in 2017 for use in a translation model.

It’s a sequence-to-sequence model capable of converting sequences from one domain into sequences in another domain. For example, translating French sentences to English sentences. The original transformer architecture consists of two parts: an encoder and a decoder. The encoder converts the input text (e.g., a French sentence) into a representation, which is then passed to the decoder. The decoder uses this representation to generate the output text (e.g., an English translation) autoregressively.

Notably, the size of the output of the transformer encoder is linear in the size of its input.

The transformer consists of multiple layers. A layer in a neural network comprises a set of parameters that perform a specific transformation on the data.

Some of the layers used in a transformer architecture are the following:

Multi-Head Attention
Add & Norm
Feed-Forward
Linear
Softmax

The layers can be sub-divided into the input, hidden and output layers. The input layer (e.g., Input/Output Embedding) is the layer where the raw data enters the network. Input embeddings are used to represent the input tokens to the model. Output embeddings are used to represent the output tokens that the model predicts.

Embeddings

To prepare language inputs for transformers, we convert an input sequence into tokens and then into input embeddings. At a high level, an input embedding is a high-dimensional vector that represents the meaning of each token in the sentence. This embedding is then fed into the transformer for processing.

Generating an input embedding involves the following steps:

Normalization (Optional): Standardizes text by removing redundant whitespace, accents, etc.
Tokenization: Breaks the sentence into words or subwords and maps them to integer token IDs from a vocabulary.
Embedding: Converts each token ID to its corresponding high-dimensional vector, typically using a lookup table. These can be learned during the training process.
Positional Encoding: Adds information about the position of each token in the sequence to help the transformer understand word order. These steps help to prepare the input for the transformers so that they can better understand the meaning of the text.

In recommendation ranking, transformer blocks can consume non-text sequences too. Recent Pin actions, query events, board interactions, and candidate entities can first become categorical feature embeddings, then pass through attention and feed-forward layers. See Recommendation Ranking Models and Embedding-Heavy Models for the recommender-specific model shape.

Layers in a transformer

In each transformer, each layer is composed by:

A Multi-head attention
A Layer normalization (Add and Norm)
A Feedforward layer
Another Layer normalization (Add and Norm)

Layer normalization (Add and Norm)

Each layer in a transformer, consisting of a multi-head attention module and a feed-forward layer, employs layer normalization and residual connections. The Add and Norm layer is applied to both the multi-head attention module and the Feedforward layer

Layer normalization

Layer normalization computes the mean and variance of the activations to normalize the activations in a given layer. This is typically performed to reduce Covariate shift as well as improve Gradient flow to yield faster convergence during training as well as improved overall performance.

Residual connections

Residual connections propagate the inputs to the output of one or more layers. This has the effect of making the optimization procedure easier to learn and also helps deal with vanishing and exploding gradients.

Residual connection

Residual in this context mean the “letfover” of the information that isn’t fully captured in the layer and its output. It is a connection because it bypass the layer function and gets added to its output

Feedforward layer

Feedforward layers are the common building blocks of deep neural network:

Input is multipliied by weights and added by biases
A non-linear activation function such as ReLU or GELU

Tip

Non linear activation is required, because if we use linear functions even a complex network would collapse to a simple linear transformation that can only capture linear relationship

The feedforward layer in transformers, by being applied independently to each token, makes possible to enrich the representation after the attention, since it can add complexity and non-linearity.

Encoder and decoder

The original transformer architecture relies on a combination of encoder and decoder modules. Each encoder and decoder consists of a series of layers, with each layer comprising key components:

a multi-head self-attention mechanism
a position-wise feed-forward network
normalization layers
residual connections.

Encoder

The encoder’s primary function is to process the input sequence into a continuous representation that holds contextual information for each token. The input sequence is first normalized, tokenized, and converted into embeddings. Positional encodings are added to these embeddings to retain sequence order information. Through self-attention mechanisms, each token in the sequence can dynamically attend to any other token, thus understanding the contextual relationships within the sequence. The output from the encoder is a series of embedding vectors Z representing the entire input sequence.

Decoder

The decoder is tasked with generating an output sequence based on the context provided by the encoder’s output Z. It operates in a token-by-token fashion, beginning with a start- of-sequence token. The decoder layers employ two types of attention mechanisms:

masked self-attention
encoder-decoder cross-attention.

Majority of recent LLMs adopted a decoder-only variant of transformer architecture. This approach forgoes the traditional encoder-decoder separation, focusing instead on directly generating the output sequence from the input. Effectively, the specific tasks for encoding and decoding are merged:

The input sequence undergoes a similar process of embedding and positional encoding before being fed into the decoder.
The decoder then uses masked self-attention to generate predictions for each subsequent token based on the previously generated tokens.

Masked self attention

Masked self-attention ensures that each position can only attend to earlier positions in the output sequence, preserving the autoregressive property. This is crucial for preventing the decoder from having access to future tokens in the output sequence.

Encoder-decoder cross-attention

The encoder-decoder cross-attention mechanism allows the decoder to focus on relevant parts of the input sequence, utilizing the contextual embeddings generated by the encoder. This iterative process continues until the decoder predicts an end-of-sequence token, thereby completing the output sequence generation.

Context length

The context length refers to the number of previous tokens the model can ‘remember’ and use to predict the next token in the sequence.

Longer context lengths allow the model to capture more complex relationships and dependencies within the text, potentially leading to better performance. However, longer contexts also require more computational resources and memory, which can slow down training and inference. Choosing an appropriate context length involves balancing these trade-offs based on the specific task and available resources.

Popular models

Model	Developer	Release Date	Architecture	Key Features	Key Differences
GPT-2	OpenAI	February 2019	Decoder-only Transformer	1.5B parameters; first widely available large language model; generated realistic text	Popularized large-scale text generation; early model with few-shot capabilities
GPT-3	OpenAI	June 2020	Decoder-only Transformer	175B parameters; few-shot learning; API access enabled a wide range of applications	High parameter count; versatile for tasks from summarization to creative text generation
GPT-4	OpenAI	March 2023	Decoder-only Transformer	Multimodal (text and image); improved reasoning capabilities	Enhanced multimodality and interpretive abilities, better handling of complex queries
LaMDA	Google	May 2021	Decoder-only Transformer	137B parameters; optimized for dialogue; trained on conversational data	Specializes in open-ended conversation and maintaining coherent, contextual responses
Chinchilla	DeepMind	March 2022	Decoder-only Transformer	Optimized for efficiency; focuses on smaller parameters with more training tokens (70B params)	Demonstrates that model performance can improve with optimal parameter-to-data scaling
PaLM	Google	April 2022	Decoder-only Transformer	540B parameters; extensive multilingual support; high reasoning capabilities	One of the largest language models; designed for tasks requiring large context and deep reasoning
PaLM 2	Google	May 2023	Decoder-only Transformer	Smaller models (4 sizes); multilingual, coding, and medical versions	More efficient than PaLM; specialized variants for domain-specific applications
Gopher	DeepMind	December 2021	Decoder-only Transformer	280B parameters; designed for a wide range of NLP tasks	Extensive evaluation on knowledge, reasoning, and reading comprehension benchmarks
GLaM	Google	December 2021	Mixture of Experts	1.2T total parameters, but only a subset active per input; efficient and scalable	Mixture-of-experts approach enables high capacity with reduced computation per input
Mistral	Mistral AI	September 2023	Decoder-only Transformer	7B parameters; fully open-source; optimized for inference efficiency	Smaller size with performance optimized for real-world deployment and efficiency
Gwen	Huawei	April 2021	Encoder-only Transformer	10B parameters; optimized for scientific and technical text	Focuses on technical document understanding and domain-specific NLP applications
LLaMA	Meta	February 2023	Decoder-only Transformer	Open to academic researchers; efficient performance on smaller hardware	Designed for high performance with fewer resources, targeted at research
LLaMA 2	Meta	July 2023	Decoder-only Transformer	7B, 13B, 70B parameters; fully open-source; includes fine-tuning capabilities	Open-source; optimized for fine-tuning and generalization across applications
Gemini 1	Google DeepMind	December 2023	Multimodal Transformer	Advanced language and visual reasoning capabilities, integrating DeepMind’s reinforcement learning expertise	Combines reinforcement learning expertise with language and vision for complex reasoning

OpenAI: GPT models evolved from GPT-2’s 1.5B parameters to GPT-4’s multimodal capabilities. GPT-3 and GPT-4 set benchmarks in few-shot learning and text generation, while Grok targets real-time social media.
Google (LaMDA, PaLM, GLaM): Models span conversational dialogue (LaMDA), large-scale multilingual capabilities (PaLM), and efficiency-focused architectures (GLaM’s mixture of experts). PaLM 2 introduced smaller, domain-optimized variants.
DeepMind (Chinchilla, Gopher): Gopher and Chinchilla focus on optimizing model scaling. Chinchilla’s high data-to-parameter ratio emphasizes that well-balanced scaling can improve performance even with fewer parameters.
Meta (LLaMA): LLaMA series is known for efficiency, small hardware requirements, and open-access for research. LLaMA 2 offered full open-source access for broader applications.
Mistral: Known for releasing an efficient, open-source 7B parameter model, designed to be highly effective with lower computational demands.

The illustrated tranformer

The illustrated transformer is an extraordinary guide to understand more available here The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time.

Edmondo's Vault

Explorer

Introduction to Transformers

Embeddings

Layers in a transformer

Layer normalization (Add and Norm)

Layer normalization

Residual connections

Feedforward layer

Encoder and decoder

Encoder

Decoder

Masked self attention

Encoder-decoder cross-attention

Context length

Popular models

The illustrated tranformer

Graph View

Table of Contents

Backlinks