The transformer architecture was developed at Google in 2017 for use in a translation model.

It’s a sequence-to-sequence model capable of converting sequences from one domain into sequences in another domain. For example, translating French sentences to English sentences. The original transformer architecture consists of two parts: an encoder and a decoder. The encoder converts the input text (e.g., a French sentence) into a representation, which is then passed to the decoder. The decoder uses this representation to generate the output text (e.g., an English translation) autoregressively.

Notably, the size of the output of the transformer encoder is linear in the size of its input.

The transformer consists of multiple layers. A layer in a neural network comprises a set of parameters that perform a specific transformation on the data.

Some of the layers used in a transformer architecture are the following:

  • Multi-Head Attention
  • Add & Norm
  • Feed-Forward
  • Linear
  • Softmax

The layers can be sub-divided into the input, hidden and output layers. The input layer (e.g., Input/Output Embedding) is the layer where the raw data enters the network. Input embeddings are used to represent the input tokens to the model. Output embeddings are used to represent the output tokens that the model predicts.

Embeddings

To prepare language inputs for transformers, we convert an input sequence into tokens and then into input embeddings. At a high level, an input embedding is a high-dimensional vector that represents the meaning of each token in the sentence. This embedding is then fed into the transformer for processing.

Generating an input embedding involves the following steps:

  1. Normalization (Optional): Standardizes text by removing redundant whitespace, accents, etc.
  2. Tokenization: Breaks the sentence into words or subwords and maps them to integer token IDs from a vocabulary.
  3. Embedding: Converts each token ID to its corresponding high-dimensional vector, typically using a lookup table. These can be learned during the training process.
  4. Positional Encoding: Adds information about the position of each token in the sequence to help the transformer understand word order. These steps help to prepare the input for the transformers so that they can better understand the meaning of the text.

Layers in a transformer

In each transformer, each layer is composed by:

Layer normalization (Add and Norm)

Each layer in a transformer, consisting of a multi-head attention module and a feed-forward layer, employs layer normalization and residual connections. The Add and Norm layer is applied to both the multi-head attention module and the Feedforward layer

Layer normalization

Layer normalization computes the mean and variance of the activations to normalize the activations in a given layer. This is typically performed to reduce Covariate shift as well as improve Gradient flow to yield faster convergence during training as well as improved overall performance.

Residual connections

Residual connections propagate the inputs to the output of one or more layers. This has the effect of making the optimization procedure easier to learn and also helps deal with vanishing and exploding gradients.

Residual connection

Residual in this context mean the “letfover” of the information that isn’t fully captured in the layer and its output. It is a connection because it bypass the layer function and gets added to its output

Feedforward layer

Feedforward layers are the common building blocks of deep neural network:

  • Input is multipliied by weights and added by biases
  • A non-linear activation function such as ReLU or GELU

Tip

Non linear activation is required, because if we use linear functions even a complex network would collapse to a simple linear transformation that can only capture linear relationship

The feedforward layer in transformers, by being applied independently to each token, makes possible to enrich the representation after the attention, since it can add complexity and non-linearity.

Encoder and decoder

The original transformer architecture relies on a combination of encoder and decoder modules. Each encoder and decoder consists of a series of layers, with each layer comprising key components:

  • a multi-head self-attention mechanism
  • a position-wise feed-forward network
  • normalization layers
  • residual connections.

Encoder

The encoder’s primary function is to process the input sequence into a continuous representation that holds contextual information for each token. The input sequence is first normalized, tokenized, and converted into embeddings. Positional encodings are added to these embeddings to retain sequence order information. Through self-attention mechanisms, each token in the sequence can dynamically attend to any other token, thus understanding the contextual relationships within the sequence. The output from the encoder is a series of embedding vectors Z representing the entire input sequence.

Decoder

The decoder is tasked with generating an output sequence based on the context provided by the encoder’s output Z. It operates in a token-by-token fashion, beginning with a start- of-sequence token. The decoder layers employ two types of attention mechanisms:

  • masked self-attention
  • encoder-decoder cross-attention.

Majority of recent LLMs adopted a decoder-only variant of transformer architecture. This approach forgoes the traditional encoder-decoder separation, focusing instead on directly generating the output sequence from the input. Effectively, the specific tasks for encoding and decoding are merged:

  • The input sequence undergoes a similar process of embedding and positional encoding before being fed into the decoder.
  • The decoder then uses masked self-attention to generate predictions for each subsequent token based on the previously generated tokens.

Masked self attention

Masked self-attention ensures that each position can only attend to earlier positions in the output sequence, preserving the autoregressive property. This is crucial for preventing the decoder from having access to future tokens in the output sequence.

Encoder-decoder cross-attention

The encoder-decoder cross-attention mechanism allows the decoder to focus on relevant parts of the input sequence, utilizing the contextual embeddings generated by the encoder. This iterative process continues until the decoder predicts an end-of-sequence token, thereby completing the output sequence generation.

Context length

The context length refers to the number of previous tokens the model can ‘remember’ and use to predict the next token in the sequence.

Longer context lengths allow the model to capture more complex relationships and dependencies within the text, potentially leading to better performance. However, longer contexts also require more computational resources and memory, which can slow down training and inference. Choosing an appropriate context length involves balancing these trade-offs based on the specific task and available resources.

Popular models

ModelDeveloperRelease DateArchitectureKey FeaturesKey Differences
GPT-2OpenAIFebruary 2019Decoder-only Transformer1.5B parameters; first widely available large language model; generated realistic textPopularized large-scale text generation; early model with few-shot capabilities
GPT-3OpenAIJune 2020Decoder-only Transformer175B parameters; few-shot learning; API access enabled a wide range of applicationsHigh parameter count; versatile for tasks from summarization to creative text generation
GPT-4OpenAIMarch 2023Decoder-only TransformerMultimodal (text and image); improved reasoning capabilitiesEnhanced multimodality and interpretive abilities, better handling of complex queries
LaMDAGoogleMay 2021Decoder-only Transformer137B parameters; optimized for dialogue; trained on conversational dataSpecializes in open-ended conversation and maintaining coherent, contextual responses
ChinchillaDeepMindMarch 2022Decoder-only TransformerOptimized for efficiency; focuses on smaller parameters with more training tokens (70B params)Demonstrates that model performance can improve with optimal parameter-to-data scaling
PaLMGoogleApril 2022Decoder-only Transformer540B parameters; extensive multilingual support; high reasoning capabilitiesOne of the largest language models; designed for tasks requiring large context and deep reasoning
PaLM 2GoogleMay 2023Decoder-only TransformerSmaller models (4 sizes); multilingual, coding, and medical versionsMore efficient than PaLM; specialized variants for domain-specific applications
GopherDeepMindDecember 2021Decoder-only Transformer280B parameters; designed for a wide range of NLP tasksExtensive evaluation on knowledge, reasoning, and reading comprehension benchmarks
GLaMGoogleDecember 2021Mixture of Experts1.2T total parameters, but only a subset active per input; efficient and scalableMixture-of-experts approach enables high capacity with reduced computation per input
MistralMistral AISeptember 2023Decoder-only Transformer7B parameters; fully open-source; optimized for inference efficiencySmaller size with performance optimized for real-world deployment and efficiency
GwenHuaweiApril 2021Encoder-only Transformer10B parameters; optimized for scientific and technical textFocuses on technical document understanding and domain-specific NLP applications
LLaMAMetaFebruary 2023Decoder-only TransformerOpen to academic researchers; efficient performance on smaller hardwareDesigned for high performance with fewer resources, targeted at research
LLaMA 2MetaJuly 2023Decoder-only Transformer7B, 13B, 70B parameters; fully open-source; includes fine-tuning capabilitiesOpen-source; optimized for fine-tuning and generalization across applications
Gemini 1Google DeepMindDecember 2023Multimodal TransformerAdvanced language and visual reasoning capabilities, integrating DeepMind’s reinforcement learning expertiseCombines reinforcement learning expertise with language and vision for complex reasoning
  • OpenAI: GPT models evolved from GPT-2’s 1.5B parameters to GPT-4’s multimodal capabilities. GPT-3 and GPT-4 set benchmarks in few-shot learning and text generation, while Grok targets real-time social media.

  • Google (LaMDA, PaLM, GLaM): Models span conversational dialogue (LaMDA), large-scale multilingual capabilities (PaLM), and efficiency-focused architectures (GLaM’s mixture of experts). PaLM 2 introduced smaller, domain-optimized variants.

  • DeepMind (Chinchilla, Gopher): Gopher and Chinchilla focus on optimizing model scaling. Chinchilla’s high data-to-parameter ratio emphasizes that well-balanced scaling can improve performance even with fewer parameters.

  • Meta (LLaMA): LLaMA series is known for efficiency, small hardware requirements, and open-access for research. LLaMA 2 offered full open-source access for broader applications.

  • Mistral: Known for releasing an efficient, open-source 7B parameter model, designed to be highly effective with lower computational demands.

The illustrated tranformer

The illustrated transformer is an extraordinary guide to understand more available here The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time.