The transformer architecture was developed at Google in 2017 for use in a translation model.
It’s a sequence-to-sequence model capable of converting sequences from one domain into sequences in another domain. For example, translating French sentences to English sentences. The original transformer architecture consists of two parts: an encoder and a decoder. The encoder converts the input text (e.g., a French sentence) into a representation, which is then passed to the decoder. The decoder uses this representation to generate the output text (e.g., an English translation) autoregressively.

Notably, the size of the output of the transformer encoder is linear in the size of its input.
The transformer consists of multiple layers. A layer in a neural network comprises a set of parameters that perform a specific transformation on the data.
Some of the layers used in a transformer architecture are the following:
- Multi-Head Attention
- Add & Norm
- Feed-Forward
- Linear
- Softmax
The layers can be sub-divided into the input, hidden and output layers. The input layer (e.g., Input/Output Embedding) is the layer where the raw data enters the network. Input embeddings are used to represent the input tokens to the model. Output embeddings are used to represent the output tokens that the model predicts.
Embeddings
To prepare language inputs for transformers, we convert an input sequence into tokens and then into input embeddings. At a high level, an input embedding is a high-dimensional vector that represents the meaning of each token in the sentence. This embedding is then fed into the transformer for processing.
Generating an input embedding involves the following steps:
- Normalization (Optional): Standardizes text by removing redundant whitespace, accents, etc.
- Tokenization: Breaks the sentence into words or subwords and maps them to integer token IDs from a vocabulary.
- Embedding: Converts each token ID to its corresponding high-dimensional vector, typically using a lookup table. These can be learned during the training process.
- Positional Encoding: Adds information about the position of each token in the sequence to help the transformer understand word order. These steps help to prepare the input for the transformers so that they can better understand the meaning of the text.
Layers in a transformer
In each transformer, each layer is composed by:
- A Multi-head attention
- A Layer normalization (Add and Norm)
- A Feedforward layer
- Another Layer normalization (Add and Norm)
Layer normalization (Add and Norm)
Each layer in a transformer, consisting of a multi-head attention module and a feed-forward layer, employs layer normalization and residual connections. The Add and Norm layer is applied to both the multi-head attention module and the Feedforward layer
Layer normalization
Layer normalization computes the mean and variance of the activations to normalize the activations in a given layer. This is typically performed to reduce Covariate shift as well as improve Gradient flow to yield faster convergence during training as well as improved overall performance.
Residual connections
Residual connections propagate the inputs to the output of one or more layers. This has the effect of making the optimization procedure easier to learn and also helps deal with vanishing and exploding gradients.
Residual connection
Residual in this context mean the “letfover” of the information that isn’t fully captured in the layer and its output. It is a connection because it bypass the layer function and gets added to its output
Feedforward layer
Feedforward layers are the common building blocks of deep neural network:
- Input is multipliied by weights and added by biases
- A non-linear activation function such as ReLU or GELU
Tip
Non linear activation is required, because if we use linear functions even a complex network would collapse to a simple linear transformation that can only capture linear relationship
The feedforward layer in transformers, by being applied independently to each token, makes possible to enrich the representation after the attention, since it can add complexity and non-linearity.
Encoder and decoder
The original transformer architecture relies on a combination of encoder and decoder modules. Each encoder and decoder consists of a series of layers, with each layer comprising key components:
- a multi-head self-attention mechanism
- a position-wise feed-forward network
- normalization layers
- residual connections.
Encoder
The encoder’s primary function is to process the input sequence into a continuous representation that holds contextual information for each token. The input sequence is first normalized, tokenized, and converted into embeddings. Positional encodings are added to these embeddings to retain sequence order information. Through self-attention mechanisms, each token in the sequence can dynamically attend to any other token, thus understanding the contextual relationships within the sequence. The output from the encoder is a series of embedding vectors Z representing the entire input sequence.
Decoder
The decoder is tasked with generating an output sequence based on the context provided by the encoder’s output Z. It operates in a token-by-token fashion, beginning with a start- of-sequence token. The decoder layers employ two types of attention mechanisms:
- masked self-attention
- encoder-decoder cross-attention.
Majority of recent LLMs adopted a decoder-only variant of transformer architecture. This approach forgoes the traditional encoder-decoder separation, focusing instead on directly generating the output sequence from the input. Effectively, the specific tasks for encoding and decoding are merged:
- The input sequence undergoes a similar process of embedding and positional encoding before being fed into the decoder.
- The decoder then uses masked self-attention to generate predictions for each subsequent token based on the previously generated tokens.
Masked self attention
Masked self-attention ensures that each position can only attend to earlier positions in the output sequence, preserving the autoregressive property. This is crucial for preventing the decoder from having access to future tokens in the output sequence.
Encoder-decoder cross-attention
The encoder-decoder cross-attention mechanism allows the decoder to focus on relevant parts of the input sequence, utilizing the contextual embeddings generated by the encoder. This iterative process continues until the decoder predicts an end-of-sequence token, thereby completing the output sequence generation.
Context length
The context length refers to the number of previous tokens the model can ‘remember’ and use to predict the next token in the sequence.
Longer context lengths allow the model to capture more complex relationships and dependencies within the text, potentially leading to better performance. However, longer contexts also require more computational resources and memory, which can slow down training and inference. Choosing an appropriate context length involves balancing these trade-offs based on the specific task and available resources.
Popular models
| Model | Developer | Release Date | Architecture | Key Features | Key Differences |
|---|---|---|---|---|---|
| GPT-2 | OpenAI | February 2019 | Decoder-only Transformer | 1.5B parameters; first widely available large language model; generated realistic text | Popularized large-scale text generation; early model with few-shot capabilities |
| GPT-3 | OpenAI | June 2020 | Decoder-only Transformer | 175B parameters; few-shot learning; API access enabled a wide range of applications | High parameter count; versatile for tasks from summarization to creative text generation |
| GPT-4 | OpenAI | March 2023 | Decoder-only Transformer | Multimodal (text and image); improved reasoning capabilities | Enhanced multimodality and interpretive abilities, better handling of complex queries |
| LaMDA | May 2021 | Decoder-only Transformer | 137B parameters; optimized for dialogue; trained on conversational data | Specializes in open-ended conversation and maintaining coherent, contextual responses | |
| Chinchilla | DeepMind | March 2022 | Decoder-only Transformer | Optimized for efficiency; focuses on smaller parameters with more training tokens (70B params) | Demonstrates that model performance can improve with optimal parameter-to-data scaling |
| PaLM | April 2022 | Decoder-only Transformer | 540B parameters; extensive multilingual support; high reasoning capabilities | One of the largest language models; designed for tasks requiring large context and deep reasoning | |
| PaLM 2 | May 2023 | Decoder-only Transformer | Smaller models (4 sizes); multilingual, coding, and medical versions | More efficient than PaLM; specialized variants for domain-specific applications | |
| Gopher | DeepMind | December 2021 | Decoder-only Transformer | 280B parameters; designed for a wide range of NLP tasks | Extensive evaluation on knowledge, reasoning, and reading comprehension benchmarks |
| GLaM | December 2021 | Mixture of Experts | 1.2T total parameters, but only a subset active per input; efficient and scalable | Mixture-of-experts approach enables high capacity with reduced computation per input | |
| Mistral | Mistral AI | September 2023 | Decoder-only Transformer | 7B parameters; fully open-source; optimized for inference efficiency | Smaller size with performance optimized for real-world deployment and efficiency |
| Gwen | Huawei | April 2021 | Encoder-only Transformer | 10B parameters; optimized for scientific and technical text | Focuses on technical document understanding and domain-specific NLP applications |
| LLaMA | Meta | February 2023 | Decoder-only Transformer | Open to academic researchers; efficient performance on smaller hardware | Designed for high performance with fewer resources, targeted at research |
| LLaMA 2 | Meta | July 2023 | Decoder-only Transformer | 7B, 13B, 70B parameters; fully open-source; includes fine-tuning capabilities | Open-source; optimized for fine-tuning and generalization across applications |
| Gemini 1 | Google DeepMind | December 2023 | Multimodal Transformer | Advanced language and visual reasoning capabilities, integrating DeepMind’s reinforcement learning expertise | Combines reinforcement learning expertise with language and vision for complex reasoning |
-
OpenAI: GPT models evolved from GPT-2’s 1.5B parameters to GPT-4’s multimodal capabilities. GPT-3 and GPT-4 set benchmarks in few-shot learning and text generation, while Grok targets real-time social media.
-
Google (LaMDA, PaLM, GLaM): Models span conversational dialogue (LaMDA), large-scale multilingual capabilities (PaLM), and efficiency-focused architectures (GLaM’s mixture of experts). PaLM 2 introduced smaller, domain-optimized variants.
-
DeepMind (Chinchilla, Gopher): Gopher and Chinchilla focus on optimizing model scaling. Chinchilla’s high data-to-parameter ratio emphasizes that well-balanced scaling can improve performance even with fewer parameters.
-
Meta (LLaMA): LLaMA series is known for efficiency, small hardware requirements, and open-access for research. LLaMA 2 offered full open-source access for broader applications.
-
Mistral: Known for releasing an efficient, open-source 7B parameter model, designed to be highly effective with lower computational demands.
The illustrated tranformer
The illustrated transformer is an extraordinary guide to understand more available here The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time.