formal grammar

The Wikipedia article, interpreted for the layman

A string is a series of symbols. A symbol can be anything. One common example is an upper- or lower-case letter of the Roman alphabet.

"cat"
"misanthropy"
"sfdggkklsl"

If we allow other special characters we can also have:

"__^(\\\sdfsd_+_$%"
"dd'dddddd"
";kfllfddf7777lskl¬¬"

A much more intuitive example of a symbol is an English word. If this were the case, then

"spectacular"
"adventure"
"heroes"

would be symbols, and a string (or sentence) would simply be a series of words, such as

"bucket July embarrass whatsoever isn't a the"

A formal grammar is a set of rules for constructing a formal language of "grammatically correct" sentences. A sentence is in the language if it can be constructed according to the rules, and if it cannot, it is not. The rules we use are, as the name suggests, very terse, formal and mathematically precise.

Note that a grammar ascribes no formal meaning to the language it describes.

Formal definition

A formal grammar has four components.

First we have a finite set Σ of terminal symbols. The final sentence must consist only of terminal symbols. For our example we'll take:

Σ = { "a", "the", "dog", "cat", "barked", "napped"}
Next we have a finite set N of nonterminal symbols. Non-terminal symbols are special. I think of them as wildcards. If a sentence still contains non-terminal symbols then it is not finished yet. To help keep this intuitive, my non-terminal symbols will all have <angle brackets>.

N = { "<sentence>", "<noun>", "<verb>", "<article>" }

Note that there can be no symbols shared between Σ and N.
One of the non-terminal symbols, S, must be nominated as a start symbol from which all sentences must be built.

S = "<sentence>"
Finally, we have a finite set P of production rules, which is things we are allowed to do to construct sentences.

Every rule must contain at least one non-terminating symbol on the left.
P = {
1. "<sentence>" → "<article><noun><verb>"
2. "<article>" → "a"
3. "<article>" → "the"
4. "<noun>" → "dog"
5. "<noun>" → "cat"
6. "<verb>" → "barked"
7. "<verb>" → "napped"
}

A grammar G is thus expressed as

G = { N, Σ, P, S }

How do we apply the rules?

Obviously, we can theoretically concatenate any combination of symbols, to make examples like

"<verb> <verb> <verb> a dog a dog <sentence>"
"cat napped a"
"a a a a a a a a a a a <noun>"

However, starting from S and following the rules P, only some of these can be constructed.

A sentence y can be derived in one step from another sentence x if there is a rule which can be applied to connect the two. For example:

x = "cat cat cat <noun> <verb> the"

Using rule 4, "<noun>" → "dog", we could generate:

y = "cat cat cat dog <verb> the"

Or, using rule 7, "<verb>" → "napped", we could generate:

y = "cat cat cat <noun> napped the"

A sentential form is any sentence which can be derived in a finite number of steps, starting from S. Examples are:

"<sentence>" (0 steps)

"<article> <noun> <verb>" (1 step)

"<article> <noun> barked" (2 steps)

"a <noun> barked" (3 steps)

"a cat barked" (4 steps)

A finished sentential form is any sentential form which contains no non-terminating symbols. Only the last of the previous five examples counts. All the others have "wildcards" in.

Finally, the formal language L(G) is the set of all finished sentences:

L(G) = {

"a dog barked"
"a dog napped"
"a cat barked"
"a cat napped"
"the dog barked"
"the dog napped"
"the cat barked"
"the cat napped"

}

A more complicated example

These rules are a little more complex.

G = {

N = { S, B }

Σ = { a, b, c }

S = S, obviously

P = {

S → aBSc

S → abc

Ba → aB

Bb → bb

}

}

Here are some strings which are in this grammar. Check to see which rule is being followed for each step.

S
abc

S
aBSc
aBabcc
aaBbcc
aabbcc

S
aBSc
aBaBScc
aBaBabccc
aBaaBbccc
aBaabbccc
aaBabbccc
aaaBbbccc
aaabbbccc

The finished sentences are the last ones in each block. All the others contain Ss or Bs. So, notice how the language described by this example grammar is:

L(G) = {

abc
aabbcc
aaabbbccc
...

}

Note that while N, Σ and P are strictly finite (though they may be very large), L is often infinite.

So what?

The concept of a formal grammar is not a mathematical curio but of critical importance in the modern world.

Using the concept of a formal grammar to try to approach actual human languages like English and Chinese is possible, though extremely difficult and of dubious usefulness.

Formal grammars are used in mathematics to define what does or does not constitute a syntactically correct mathematical statement or equation. We can create a language of mathematics.

Bad:

488023a ++ 5353 ==== / 5 / 5 /
π * 2.3.34534532.!
993333339X999999 [

Good:

a² + b² = c² + 17.3
X ⇒ ( ( X ⇒ Y ) ⇒ Y )
2 + 2 = 87

Note again that a grammar ascribes no formal meaning to the language it describes. The fact that the statement is syntactically correct does not mean that it is factually correct. Grammatical correctness draws a line between gibberish and nonsense, but not between nonsense and sense.

Very simple formal grammars are used as the basis for the statements of propositional logic, which itself is the basis for much more valuable mathematics.

Formal grammars are also used to define syntactical correctness in machine languages: that is, programming languages (such as C, Java, Python, and almost any other language you can think of), markup languages (most notably HTML, which is used to describe web pages, and XML, which is skyrocketing in popularity for sending machine-generated information between points), and querying languages (SQL and its many brethren).

If you know HTML, a great example can be found by checking out the Document Type Definition of the web page that you're currently looking at: http://www.w3.org/TR/html4/loose.dtd. A DTD defines what constitutes valid HTML in the document to follow, and it explains which attributes each HTML element may have, and which elements may nest inside that element. Although the syntax is a little different from the one above, you can probably work out how to read the DTD and spot the wildcards. Notice how the syntax of the DTD is itself very strict and rigorous; it, too, is subject to the constraints of a formal grammar.

Parsing

In order to validate a sentence, i.e. tell whether it is contained within the formal language generated by its grammar, you have to reconstruct the original sequence of production steps. There are no shortcuts. You simply have to break it down into its components manually. This is almost always a necessary step in making sense of the sentence, in any case.

Unfortunately, for an arbitrary sentence and an arbitrary formal grammar, this process is very difficult, requires a fully-functional Turing machine and can be extremely computationally intensive. Because all of the machine languages mentioned above need to be valid, and they need to be broken down into their components in order to be useful, we have a conundrum.

The main way in which we can make this job easier is by putting restrictions on the types of rules and symbols we use in our formal grammar.

The most important of the various types of restricted grammar is a context-free grammar. In a context-free grammar, all of the production rules start with a single non-terminating character.

The rule

"<verb>" → "barked"

is non-contextual.

The rules

"<noun> <verb>" → "dog barked"

"<noun> <noun>" → "dog"

"dog <verb>" → "dog barked"

are contextual. They provide restrictions on when the sentence may be transformed - in the last example, we are dependent on the existence of the word "dog" immediately before the "<verb>", to provide context. If we had the sentence "cat <verb>" then this rule could not be used, because of the incorrect context.

In a context-free grammar, these rules would not be permitted. The position and context of the single non-terminating character would not affect the operations which could be performed upon it.

(Note how the first example grammar in this writeup is context-free as originally presented, while the second is not.)

Sentences in context-free languages can easily be broken down into their original components and production steps, and the automaton required need not be as complex as a full Turing machine - you can get away with a simpler machine called a pushdown automaton. Parsing sentences in context-free grammars is computationally simpler, more mathematically tractable and has in fact been the subject of a great deal of valuable study. Taking an existing contextual grammar and formulating it to become context-free (by changing its rules, without altering the resulting formal language) is a very useful endeavour.

Increasingly restricted grammars such as regular grammars require decreasingly complex automata to parse.

I should probably have mentioned Noam Chomsky who discovered all of this, but I couldn't think of anywhere good to put him.

Pushdown Automaton	formal language	Turing Machine	Noam Chomsky
Anglo	never judge a by a reading	Every poem about thunder was written under the protection of a roof, or, at the very least, an empty sky	operator-precedence parsing
regular grammar	Document Type Definition	Colorless green ideas sleep furiously	context-free grammar
Roman Alphabet	Chomsky hierarchy	String	HTML
From Darkness, Lead Me To Light	Stephen Pinker	Letters to the Editors: January 2010	The Imaginarium of Dr. Parnassus
orphan negative	The Machine in the Ghost	big-O notation

Formal definition

How do we apply the rules?

A more complicated example

So what?

Parsing

Page category: