For a formal grammar G = { N, Σ, P, S }, where...
- N is a finite set of nonterminal symbols
- Σ is a finite set of terminal symbols, disjoint from N
- P is a finite set of production rules, each consisting of
- A finite string of symbols on the left (with at least one of them being nonterminal in N)
- A finite string of symbols on the right
- A special starting symbol S which is nonterminal in N
...the formal language L(G) is the set of all finite strings of symbols which can be produced by...
- Starting with a string consisting of just an S
- Applying a production rule to the current string, to turn it into a new string
- Repeating step 2 until there are no nonterminal symbols left in the string.
At this point, all the symbols in the string are terminal in Σ, which means no more production rules can be applied. So, the sentence itself is terminal.
While G is finite, L may be finite, countably infinite, or empty. L is distinct from the set of all finite strings of terminal symbols in Σ, because L only contains the producible strings. L is the set of all terminal strings.
The relationship between G and L
It is a relatively straightforward recursive process to derive all the the members of L from G.
- Stage 1: start with S
- Stage 2: take each production rule in turn. Try to apply it to S. Sometimes the rule is not applicable, but sometimes a new string is produced. Some of these are terminal. Add these to L. The rest are nonterminal. Hand these off to Stage 3.
- Stage 3: take each production rule in turn. Take each nonterminal string from Stage 2 in turn. Try to apply the rule to the string. This produces many, many more strings. Some of these are terminal. Add these to L. The rest are nonterminal. Hand these off to Stage 4...
- Continue...
This process may continue forever, as we know, because L may be infinite.
But what about reversing this process: turning a set of strings L back into a finite formal grammar G? This is much harder.
If L is finite, e.g....
L = {
fjjw
a
moocow
floipojwerpingelrnv
kkkkkkk
peas
...
zzzzz
}
...then in theory we can just enumerate all the members of L and have a grammar which produces each one in turn...
G = {
N = { S }
Σ = { a, b, ..., z } [every terminal character seen in L]
S = S, obviously
P = {
S → fjjw
S → a
S → moocow
S → floipojwerpingelrnv
S → kkkkkkk
S → peas
...
S → zzzzz
}
}
However, this is trivial, and not terribly helpful, because G is as huge as L and therefore doesn't tell us anything about L, which is the point of the exercise.
If we wish to make a small formal grammar from L, then the task becomes substantially more difficult. If L is infinite, then the task can even become impossible.
Proof that some languages have no formal grammar
If a language is a set of finite strings, then a language L is a subset of the set (call it V) of all possible finite strings. This means that the set of all languages is the set of all possible subsets of V. This is called the power set of V, which is written P(V). A power set is always strictly larger than the original set. V is countably infinite, meaning that P(V) is uncountably infinite. There are uncountably many languages.
However, every formal grammar G is finite. This means that the set of all possible formal grammars is countably infinite. Each formal grammar generates a single language L, which means, even including duplicates, that there only exists a countably infinite number of formal languages.
This means that uncountably many languages must not be formal languages; they cannot be generated by a finite formal grammar.