Data Behind the Large Language Models (LLM), GPT, and Beyond
Decoding the Data that Powers Large Language Models (LLMs) and GPT, and Looking Ahead to Their Future
A Brief History of Large Language Models
Large language models are computer programs that can generate natural language text based on a given input. Language models date back to Claude Shannon, who founded information theory in 1948 with his seminal paper, A Mathematical Theory of Communication. The first large language models were developed in the late 2000s and early 2010s, such as Google’s N-gram model and Microsoft’s Web N-gram model.
An n-gram is a collection of n successive items in a text document that may include words, numbers, symbols, and punctuation.
These probabilities are computed based on the number of times various n-grams (e.g., ate the mouse, ate the cheese) occur in a large corpus of text, and appropriately smoothed to avoid overfitting.
N-gram models are extremely computationally efficient and statistically inefficient. On the other hand, neural-network language models are statistically efficient but computationally inefficient.