Word n-gram language model

A word n-gram language model is a statistical model of language which calculates the probability of the next word in a sequence from a fixed size window of previous words. If one previous word is considered, it is a bigram model; if two words, a trigram model; if n − 1 words, an n-gram model.

Special tokens are introduced to denote the start and end of a sentence $\langle s\rangle$ and $\langle /s\rangle$ . To prevent a zero probability being assigned to unseen words, the probability of each seen word is slightly lowered to make room for the unseen words in a given corpus. To achieve this, various smoothing methods are used, from simple "add-one" smoothing (assigning a count of 1 to unseen n-grams, as an uninformative prior) to more sophisticated techniques, such as Good–Turing discounting or back-off models.

Word n-gram models have largely been superseded by recurrent neural network–based models, which in turn have been superseded by Transformer-based models often referred to as large language models.