The { 1p, p prime } set is composed of all the prime numbers written in base 1. For example, in base 1, the number 5 is written 11111, and the { 1p, p prime } set contains the elements 11, 111, 11111, 1111111, etc.

The question raised here is to determine if this language is a regular language. If the answer is positive, then it would mean that it would be possible to do primality testing with regular expressions (i.e. that prime numbers can be accepted by a deterministic finite automaton DFA). But on the contrary if the answer is negative, can algorithms similar to regular expressions be designed to do primality testing and what changes are necessary to reach this goal?

Unfortunately, the language is NOT a regular language so it is not possible to do primality testing with regular expressions. Still, it has been shown that it is possible to do primality testing with Perl regexs. Read on to learn how this is possible...

The fact that { 1p, p prime } is not a regular language can be proven thanks to the Pumping Lemma (also called Pigeon Hole Lemma because of its proof). In short, the lemma shows that sufficiently long strings s of a regular language (L) are composed of a short prefix uv and the remaining characters w (s=uvw) such that the words uviw are in L, whatever positive integer value is given to i. The 'short' adjective means that its length is less or equal than the number of states (which is finite) of the automaton.

Here comes the proof

The goal is to prove that the negation of the Pumping Lemma is true : there exists arbitrarily long strings for which every decomposition s=uvw implies the existence of i such that uviw is not in the language.

The first thing needed is the set the theorem will be applied on :

Let L = { 1p, p prime }
Note that for every element k of L, length(k) is prime. It is therefore equivalent to talk of k as a prime number or as a string which length is this prime number.
Say an automaton A with k states recognizes L. Let p an element of L, length(p) > k.
For every decomposition p=uvw, there exists i > 0 such that : length(uviw) = length(u) + i*length(v) + length(w) is not prime, and therefore uviw is not in L. Hence L is not a regular language.
For example, with i=length(u)+length(w), length(uviw) = (length(u) + length(w))*(length(v)+1). length(uviw) isn't a prime number because it is composed of two factors greater than 1.

What went wrong ?

Roughly speaking, this means that the base 1 representation of prime numbers isn't composed of a finite set of patterns, linked together by means of union and the Kleene star (words are of the form ab*c : ac, abc, abbc, abbbc, abbbbc...). Deterministic Finite automata are ineffective at recognizing this language because "regular expressions can't count". Application of the lemma shows that an infinite number of states is required to accept this language. Furthermore, it shows that no infinite subset of L is a regular language because none of the prime numbers has a uvw decomposition that satisfies the lemma.

It is nevertheless possible to do primality testing with enhanced regular expressions such as Perl regexs. Perl regexs can create a capture buffer of a previously matched substring and use it for recognition of the language. The idea behind primality testing with Perl regexs is that if a substring of length p ≥ 2 is repeated q ≥ 2 times, the string is of length pq and therefore is not prime.

Finite automata lack memory. A state machine only knows at which state it is. This is the philosophy behind state machines : they have a finite number of states and no memory. On the contrary, a Turing machine can alter the infinite tape it is reading and therefore simulate the effects of having an infinite number of states. The Perl capture buffer is one way of achieving this (provided the memory of the computer is sufficient).

What is it then ?

{ "1^p" p prime } isn't a regular language, still it is a language. Chomsky defines four types of language, known as the Chomsky hierarchy : regular, context free, context-sensitive and unrestricted, each of them having their specific automaton type (DFA, push-down automaton, linear-bounded automaton, Turing machine). m_turner pointed out that this language must be a context-sensitive language.

The Pumping Lemma does not prove that the prime number language is not a regular language, but it does prove that in base 1 it is not true. Unfortunately for people interested in cryptography, it isn't true for most representations (for example the prime in binary language isn't regular either).