A language model predicts the probability of a sequence of words. Commonly, when given a prefix of text, a language model assigns probabilities to subsequent words. For example, given the prefix “The most famous city in the US is…”, a language model might predict high probabilities to the words “New York” and “Los Angeles” and low probabilities to the words “laptop” or “apple”.
You can create a basic language model by storing an n-gram table, while modern language models are often based on neural models, such as transformers. Before the invention of transformers, recurrent neural networks (RNNSs) were the popular approach for modeling sequences. In particular, long short-term memory (LSTM) and [gated recurrent unit] were common architectures.
The sequential nature of RNNs makes them compute-intensive and hard to parallelize during training (though recent work in state space modeling is attempting to overcome these challenges) while transformers can process tokens in parallel thanks to the self-attention mechanism, meaning they:
- can better model long term context
- are easier to parallelize
These differences make transformers significantly faster to train and more powerful than RNN for handling long-term dependencies in long sequence tasks.
Warning
The cost of self-attention in the original transformers is quadratic in the context length, which limits the size of the context, while RNNs have a theoretically infinite context length