Alphabet (formal languages)

In formal language theory, an alphabet, often called a vocabulary in the context of terminal and nonterminal symbols, is a non-empty set of indivisible symbols/characters/glyphs, typically thought of as representing letters, characters, digits, phonemes, or even words. The definition is used in a diverse range of fields including logic, mathematics, computer science, and linguistics. An alphabet may have any cardinality ("size") and, depending on its purpose, may be finite (e.g., the alphabet of letters "a" through "z"), countable (e.g., $\{v_{1},v_{2},\ldots \}$ ), or even uncountable (e.g., $\{v_{x}:x\in \mathbb {R} \}$ ).

Strings, also known as "words" or "sentences", over an alphabet are defined as a sequence of the symbols from the alphabet set. For example, the alphabet of lowercase letters "a" through "z" can be used to form English words like "iceberg" while the alphabet of both upper and lower case letters can also be used to form proper names like "Wikipedia". A common alphabet is {0,1}, the binary alphabet, and "00101111" is an example of a binary string. Infinite sequences of symbols may be considered as well (see Omega language).

Strings are often written as the concatenation of their symbols, and when using this notational convention it is convenient for practical purposes to restrict the symbols in an alphabet so that this notation is unambiguous. For instance, if the two-member alphabet is {00,0}, a string written in concatenated form as "000" is ambiguous because it is unclear if it is a sequence of three "0" symbols, a "00" followed by a "0", or a "0" followed by a "00". However, this is a limitation on the notation for writing strings, not on their underlying definitions. Like any finite set, {00,0} can be used as an alphabet, whose strings can be written unambiguously in a different notational convention with commas separating their elements: 0,00 ≠ 0,0,0 ≠ 00,0.