A typical transformer training loop consists of several parts:

  1. First, batches of input sequences are sampled from a training dataset.
  2. For each input sequence, there is a corresponding target sequence (In unsupervised pre-training, the target sequence is derived from the input sequence itself)
  3. The batch of input sequences is then fed into the transformer.
  4. The transformer generates predicted output sequences.
  5. The difference between the predicted and target sequences is measured using a loss function (often Cross-entropy loss)
  6. Gradients of this loss are calculated, and an optimizer uses them to update the transformer’s parameters.

This process is repeated until:

  • the transformer converges to a certain level of performance or
  • it has been trained on a pre-specified number of tokens.

There are different approaches to formulating the training task for transformers depending on the architecture used: • Decoder-only models are typically pre-trained on the language modeling task using Causal Language Modeling / Next Token PredictionEncoder-only models (like BERT are often pre-trained by corrupting the input sequence in some way and having the model try to reconstruct it. One such approach is masked language modeling (MLM). • Encoder-decoder models (like the original transformer) are trained on sequence-to- sequence supervised tasks such as translation, question-answering, and summarization. These models could also be trained in an unsupervised way by converting other tasks into sequence-to-sequence format.

Pre-training

The first stage of training a transformer is often referred to as pre-training, is the foundational stage where an LLM is trained on large, diverse, and unlabelled text datasets where it’s tasked to predict the next token given the previous context. The goal of this stage is to leverage a large, general distribution of data and to create a model that is good at sampling from this general distribution.

After language model pretraining, the resulting LLM usually demonstrates a reasonable level of language understanding and language generation skills across a variety of different tasks which are typically tested through zero-shot or few-shot prompting. Pretraining is the most expensive in terms of time (from weeks to months depending on the size of the model) and the amount of required computational resources, (GPU/TPU hours).

The following step is Fine tuning