Member-only story
Data Behind the Large Language Models (LLM), GPT, and Beyond
Decoding the Data that Powers Large Language Models (LLMs) and GPT, and Looking Ahead to Their Future

A Brief History of Large Language Models
Large language models are computer programs that can generate natural language text based on a given input. Language models date back to Claude Shannon, who founded information theory in 1948 with his seminal paper, A Mathematical Theory of Communication. The first large language models were developed in the late 2000s and early 2010s, such as Google’s N-gram model and Microsoft’s Web N-gram model.
An n-gram is a collection of n successive items in a text document that may include words, numbers, symbols, and punctuation.
These probabilities are computed based on the number of times various n-grams (e.g., ate the mouse, ate the cheese) occur in a large corpus of text, and appropriately smoothed to avoid overfitting.
N-gram models are extremely computationally efficient and statistically inefficient. On the other hand, neural-network language models are statistically efficient but computationally inefficient.
Built upon this idea, since 2003, two other key developments in neural language modeling include:
- Recurrent Neural Networks (RNNs) including Long Short-Term Memory (LSTMs)
- BERT (Bidirectional Encoder Representations from Transformers)
- T5 (Text-to-Text Transfer Transformer)
With the rise of deep learning in the 2010s and the major hardware advances (e.g., GPUs), the size of neural language models has skyrocketed. The following table shows that the model sizes have increased by an order of five thousand times over just the last 4 years:

