Let L be a regular language. That is, L is a set of words which is recognised by a finite state automaton (or, equivalently, by some regular expression). Consider a DFA recognising L, and suppose L is infinite. Suppose a DFA D recognising L has k states. We may write the steps D takes when recognising a word w=w1w2...wn above the letters:

```s1s2...sn
w1w2...wn
```
This just means that after reading w1...wk, D is in state sk.

Then for a word w in L of length >k, due to the pigeonhole principle, some state s must repeat itself: s=si=sj, i < j. Then D must also accept

```w1w2...wiwj+1...wn
w1w2...wi...wjwi+1...wj...wn
```
and so on. In fact, if we write x=w1...wi, y=wi+1...wj, z=wj+1...wn (x and y may both be empty), then w=xyz, and for all nonnegative integers m, the word xymz is also in L.

We've just proved

[The pumping] lemma: Let L be an infinite regular language. Then there exists some k such that if w∈L has more than k letters, then w=xyz where
For all j≥0, xykz∈L.

The pumping lemma is useful for proving certain languages are not regular. This approach is usually taken in introductory computation courses (see e..g pumping lemma proof that the balanced braces language is not regular). However, in practice the Myhill / Nerode theorem is much more useful for proving that languages are not regular.

Pumping lemmas also exist for CFGs (context free grammars); this is noded below.

About the Pumping Lemma Here I try to explain the Pumping Lemma in plain English. Still some knowledge of Computational Theory is assumed.

To prove that a Language is Regular, one can construct a DFA, an NFA or a Regular Expression for that Language. If any of these can accept or express the exact set of strings in the Language, then it can be said that the Language is Regular.

Such proofs are called proof by construction.

To prove that a language is not Regular is a different matter. The fact that a DFA, NFA or a Regular Expression has yet to be found to accept or express all strings in the language is not sufficient evidence. It is always possible that such a device still exists but has yet to be found! In order to prove that a language is not Regular, we will use a different mechanism. We will consider a property that is true of all Regular Languages. This property is known as the Pumping Lemma. We will then assume that a language which we suspect is not Regular to be Regular, and then demonstrate that the properties defined by the Pumping Lemma do not hold. In other words, we will utilize proof by contradiction.

The idea of the Pumping Lemma is as follows:

Consider a DFA that consists of three states.
We know that when we run the Machine on a string, a transition is made for each character in the string. Each transition leaves us in a predetermined state. This means that for every character in a string, we have visited a (not necessarily unique) state. If our DFA has three states, and we run a string of length five on it, (ignoring the start state) five states have been reached. If our DFA consists of only three states, the pigeon-hole principle states that at least on state must have been visited multiple times. If this is true, we know that there is at least one cycle in our DFA. By cycle, we mean, a sequence of states that start and end with the same state. If we look at the DFA, we can find a sequence of characters that will bring us through this cycle. Since we return to the same state each time we go through the cycle, it can be seen that if a cycle can be traversed once in a DFA, it can be traversed infinitely many times. In other words; If a DFA has a cycle that, when traversed, keeps open the possibility of exiting and reaching an accept state, any string that has more characters than the number of states in the DFA will have a substring which is infinitely repeatable. When this substring is repeated (pumped), the resulting string will also be in the language of the DFA. Since this is true of all DFA's, it is true of Regular Languages.

To prove that a language is not Regular, we show that the Pumping Lemma does not hold. To do this, find a string which is known to be larger than the Pumping L ength. Show that there is no substring that can be infinitely repeated while keeping the resulting string in the language. If you can not show this, try another string. There may be some strings in non-regular languages that are perfectly pumpable! Failure to find such a string proves nothing about the language being Regular or Non-Regular. The Pumping Lemma can not be used to prove that a language is Regular.

The Formal definition of the Pumping Lemma can be found on page 78 of Michael Sipser's Introduction to the Thoery of Computation.

It says that if a language is Regular, a Pumping Length can be found. Remember, intuitively, we consider this number to be the number of states in our Machine.
If a string is longer than this length, it is possible to subdivide it into three parts:
x, y and z

x consists of all characters parsed before the DFA cycle.
y consists of all characters parsed in the DFA cycle.
z consists of all characters parsed after the DFA cycle.

There are three properties that this string, under the subdivision will have. Here I explain their signifigance.
• for all i >= 0, x(y^i)z will be a string that is in t he language.
This makes since because the cycle can be traversed any number of times.
(By y^i, we mean y*)
• |y| > 0
If the length of y were 0, this would mean that there is nothing to pump ! Essentially, there is no part of the string that has gone through the cycle.
• |xy| < = p
If the length of xy were greater than p, we have made a bad choice for y.
xy is longer than the number of states, y is our cycle, bu t before we have finished traversing it, we have already seen some state more than once. There must be a cycle somewhere earlier in the string. That earlier cycle should be our y!

When proving that the language is not regular, it is not sufficient to choose one arbitrary subdivision of the string into x,y,z and show that y can not be pumped. It must be shown that for all possible subdivisions of the string into x,y,z, pumping y will always, at some point, result in a string which is not in the language.

The Pumping Lemma for Context-Free Languages

If a language is a context-free language (CFL), then there exists a number called the pumping length such that any string in the language which has length equal to or greater than the pumping length can be divided into five pieces which satisfy the following conditions:

First, let's define some variables...
A is the language
p is the pumping length
s is a string in the language A made up of 5 substrings such that s = uvxyz

1. For all i ≥ 0, uvixyiz is in the language A
2. |vy| > 0
3. |vxy| ≤ p
Or for those of us who prefer plain English:
1. The v and y parts of s can be removed or repeated any number of times. As long as they are both removed or both repeated the same number of times, the resulting string will be in the language A. For example uxz, uvvxyyz, and uvv...vxyy...yz are all in A.
2. v and y can not both be the empty string (ε); however, one of them can be.
3. If you add the lengths of v, x and y, the sum can not be greater than p.

The first condition is really the heart of the lemma. Without the second condition, the lemma would be true for all languages, therefore useless. The third condition is not necessary, but it makes using the pumping lemma easier.

The Proof

We must show that for a CFL A, any string string s which is long enough can be pumped, and the resulting string will also be in A.1

Assume that s is a very long string (we'll be more specific about what "very long" is later). If G is a context-free grammar (CFG) which generates A, then s is derivable from G, and therefore has a parse tree.2 Since the right-hand side (RHS) of rules in a CFG are of finite length, strings generated by G which are beyond a certain length require increasingly taller parse trees. More simply, very long strings lead to very tall parse trees.

Let V be the set of variables for G. If the height of the parse tree for s is greater than |V| + 1, then by the pigeonhole principle, we know that in the longest branch of the parse tree, at least one of the variables is repeated. This is intuitive, since all but the last node in a parse tree are variables, then the longest branch has more than |V| variables. If we have only |V| different ones, then at least one of them is repeated.

So, now that we know there is at least one repetition of a variable in the longest branch of the parse tree. Call this variable R. The portion of the parse tree between the first and second appearance of R can be removed (pumping down) or repeated (pumping up) any number of times. This will result in different strings without invalidating the parse tree. This is easier to visualize with the following figure:

S is the start varaible

The original tree: (The colons represent an unknown number of intermediate steps in the branch)

```
S
/:\
/ : \
/  :  \
/   R   \
/   /:\   \
/   / : \   \
/   /  :  \   \
/   /   R   \   \
/   /   / \   \   \
/___/___/___\___\___\
u   v   x   y   z```

The pumped down tree:

```
S
/:\
/ : \
/  :  \
/   R   \
/   / \   \
/   /___\   \
/   /  x  \   \
/   /       \   \
/   /         \   \
/___/           \___\
u               z```

The pumped up tree:

```
S
/:\
/ : \
/  :  \
/   R   \
/   /:\   \
/   / : \   \
/   /  :  \   \
/   /   R   \   \
/   /   /:\   \   \
/___/___/ : \___\___\
u   v/  :  \y   z
/   R   \
/   / \   \
/___/___\___\
v   x   y```

Now that it is reasonably obvious that the concept is sound, we show how to calculate the pumping length and prove the three conditions.

Calculating the Pumping Length

Given G, call b the maximum number of symbols on the RHS of a rule in G. We are only concerned with grammars with b ≥ 2.3 Since a node in the parse tree can have at most b children, a parse tree of height h can have at most bh leaf nodes. Since all leaf nodes are terminals, this is the maximum length of the string generated by a parse tree of height h.

Recall that |V| is the number of variables in G. If we set p to be b|V|+2, then we know the height of the tree is at least |V| + 2, therefore at least one variable repeats, and the string can be pumped.

Condition 1

Given that s is a string in A of length ≥ p, we will show how to pump s. Let τ be a parse tree for s. Since |s| ≥ p, we have already shown that τ has height ≥ |V| + 2, and therefore at least one variable is repeated in the longest branch of τ. Call this repeated variable R. Furthermore, choose R to be a variable that is repeated among the last |V| + 1 variables on the branch.

Divide s into uvxyz according to the first figure. The two instances of R each have a subtree which generates part of s. The upper instance generates vxy, and the lower instance generates just x. Since both subtrees are generated by the same variable, we can substitute one for the other, and still have a valid parse tree. Replacing the upper subtree with the lower subtree (as in the second figure) generates the string uxz. Replacing the lower subtree with the upper subtree (as in the third figure) generates uvvxyyz. This can be done repeatedly to generate all strings of the form uvixyiz for i > 1. We know that all these strings are in the language A because they are generated by G.

Condition 2

If both v and y were the empty string, then the lemma would hold true for any p and any string. This is true since the strings ab, aεb, and aεε...εb are all equivalent. Any number of empty strings can be inserted anywhere in any string without changing it. We include this condition because the pumping lemma is useful for proving that languages are not context-free. To do that one must show that the pumping lemma fails.

Condition 3

We must show that |vxy| ≤ p. In the parse tree for s (the first figure), the upper instance of R generates vxy. Since we chose R to be a variable which is repeated among the last |V| + 1 variables on the path, we know that the subtree which generates vxy has height ≤ |V| + 2. As we previously proved, the length of the longest string that could be generated by a tree of this height is b|V|+2, which is the value we chose for p.

1. "Pumping" refers to changing the value of i in the expression uvixyiz. Pumping up means making i greater than 1. Pumping down means making i 0.
2. By the definition of CFLs, such a grammar is guaranteed to exist.
3. Grammars with b < 2 can not generate strings of more than 1 symbol.