Prime number language

The set of prime numbers is certainly not a regular language. Unfortunately this has not been proven yet and will probably never be because of the countless ways existing to represent a number. There is indeed a great difference between the concept of number and the characters used to represent them (22 decimal = 16 hexadecimal = 10110 binary = twenty two = vingt-deux = ...). Not to mention that they can be represented as a product of primes, which makes primality testing very very easy, except that writing them in that form isn't that easy.

Using finite state machines to deal with prime numbers sounds interesting. Determining the acceptance of a string by a deterministic finite automaton (DFA) is a really fast process. Besides that, this process has the great property that it can be inverted : every path linking an entry point to an exit point of the state machine gives an accepted string, thus the same state machine could be used for prime number generation and primality testing, with the great advantage that the process is in O(n) for both ways.

There aren't many papers out there which explain what doesn't work. Yet, understanding why things are intrinsically impossible is as important as understanding why others work. As you'll see, finding out why finite state machines are inefficient to deal with prime numbers points out interesting results on the set of prime numbers and deterministic finite automata.

Prime numbers and base 1

A prime number p written in base 1 consists of p times the digit 1, and therefore its length is the prime number p. But unfortunately this language, as shown in "1^p" p prime language is not a regular language. The proof points out that a state machine with an infinite number of states is required to match this language.

Nevertheless, it takes only a few modifications to have a state-machine-like algorithm for the language. One idea is to introduce a capture buffer : as the token advances in the state machine, a buffer is filled with characters. The buffer can then be tested against the remaining characters. The idea behind this is to fill the capture buffer with at least two characters and test if this sequence is repeated at least two times to represent the number. It is therefore possible to do primality testing with Perl regexs. But these are not state machines. When acceptance of the string fails, it means that the choice of the buffer was bad. The algorithm has to start over with a new buffer until there are no more cases left. This method is in O(n²) and is only of theoretical interest.

The important thing shown here is that state machines do not know how to count. This means that they can only count up to a finite given value, but not an arbitrarily big one. For example there exists state machines that will recognize odd numbers (eg. "1(11)*"), even numbers (eg. "(11)*") or numbers which are equal to three modulus 5 (eg. "(11111)*111"), because in those cases they only need to count up to 1, 2, 3 and 5. For prime numbers, they would have to count up to that prime number (or its square root, etc...) and it can be arbitrarily big. One can only build a state machine that tests divisibility by a finite set of numbers, for example "(11)*|(111)*|(11111)*|(1111111)*" will accept any integer divisible by 2, 3, 5 or 7.

Prime numbers and base 2

State machines are meant to recognize patterns. In base 2 (aka. binary), prime numbers are represented by a succession of '0's and '1's and therefore this base might be more adapted to primality testing since it creates "patterns". Say I write a dot (.) for each 0 of the binary representation of the prime, and a X for each 1. If I write that way the prime numbers from 3 to 100, I get an interesting diagram which is reproduced below. Two things are important to notice about it : can repeating patterns be found and how does its width expand.

XX
X.X
XXX
XX.X
X.XX
X...X
XX..X
XXX.X
X.XXX
XXXXX
X.X..X
X..X.X
XX.X.X
XXXX.X
X.X.XX
XX.XXX
X.XXXX
XX....X
XXX...X
X..X..X
XXXX..X
XX..X.X
X..XX.X
X....XX
X.X..XX
XXX..XX
XX.X.XX
X.XX.XX
X...XXX
XXXXXXX
XX.....X
X..X...X
XX.X...X
X.X.X..X
XXX.X..X
X.XXX..X
XX...X.X
XXX..X.X
X.XX.X.X
XX..XX.X
X.X.XX.X
XXXXXX.X
X.....XX
X.X...XX
XXX...XX
XX..X.XX
XXXXX.XX
XX...XXX
X.X..XXX
X..X.XXX
XXXX.XXX
X...XXXX
XX.XXXXX
X.......X
XXX.....X
X.XX....X
XXXX....X
X.X.X...X
X..XX...X
XX.XX...X
X.X..X..X
XX..XX..X
XXX.XX..X
X..XXX..X
X.XXXX..X
XX.X..X.X
X...X.X.X
XX.XX.X.X
X.XXX.X.X
X....XX.X
XXX..XX.X
XXXX.XX.X
X.X.XXX.X
XXXXXXX.X
X.X....XX
X.XX...XX
X...X..XX
X..XX..XX
XX...X.XX
X.X..X.XX
XXXX.X.XX
X...XX.XX
XXX.XX.XX
XX.XXX.XX
X.....XXX
X..X..XXX
X.XX..XXX
XXXX..XXX
XX..X.XXX
XXXXX.XXX
XXX..XXXX
XX.X.XXXX
XX..XXXXX
XXX.XXXXX
X.XXXXXXX
X..X.....X
XX.X.....X

The characters . (dot) and X have been chosen because they contrast better than 0 and 1. The digits are written backwards : 19 should be written X..XX instead of XX..X

The diagram will widen indefinitely because the set of prime numbers is infinite. Gauss stated that Π(n), the number of primes ≤ n, is equivalent to n/ln(n). Note that Π^-1(k) gives the k^th prime number. Gauss's equivalent is very good, indeed 0.9*n/ln(n) < Π(n) < 1.1*n/ln(n) for not so large values of n (n ≥ 100). There are Π(2^k+1) - Π(2^k) prime numbers between 2^k and 2^k+1, ie. that have k binary digits. That represents roughly 2^k/ln(2)*(2/(k+1)-1/k) primes, which is equivalent to 2^k/(k*ln(2)) for large values of k.

But this is not likely to be the number of different words with a given length of a regular language. Say φ(y) is the number of words of a regular language, formed of y letters. It can be proven that φ(y) is a recurrent linear sequence (see regular language), and thus the (real) solutions are linear combinations of terms of the form P(y)*exp(λ*y), where P is a polynom and λ a real number. It will never be equivalent to 2^k/(k*ln(2)). This yields an interesting result : the growth of a regular language is either equivalent to a linear function of positive or null slope, or equivalent to a function of the form μ*y^kexp(λ*y), k integer ≥ 0 and λ, μ real number > 0. The growth of the set of prime numbers is not compatible with that of a regular language, therefore it isn't a regular language.

Prime numbers and base n

It would be quite complex to prove that the set formed of the base n representations of prime numbers isn't a regular language using the pumping lemma as is is done in "1^p" p prime language and in "p in binary" p prime language because it involves heavy arithmetic. But it is reasonable to think that is can be done.

However, the estimation of words of length k remains valid for other bases; in base n it is equivalent to n^k/(k*ln(n)), which is not the growth of a regular language. Prime numbers written in any base never form a regular language.

Subsets of prime numbers and prime number generation

Suppose I wish to generate prime numbers. I don't need a function that is able to generate every prime number, but an infinite subset of primes will do. Hence the question : is it possible to build a deterministic finite automaton that will generate an infinite set of primes ? The answer is unfortunately no.

The idea behind this method would be to take a bunch of prime numbers and to mix them up in order to generate a new prime number. Since the set of base n representations of prime numbers isn't a regular language, thanks to the pumping lemma it is possible to show that none of the prime numbers can be decomposed in the form uvw such that uvⁱw is a prime number for every given integer i. It is proven for bases 1 and 2 and it is reasonable to think that it is true for other bases.

A method is to use an algorithm that gives numbers which are mostly primes and use a probabilistic test to check out the bad ones. For example it is known that many numbers of the form 2^2ⁿ-1 and 2^2ⁿ+1 (2^(2^n) +/- 1) are prime numbers.

Closing words

Deterministic finite automata aren't suited to deal with prime numbers. Even though methods can be derived from them, they lack performance results and are only of theoretical interest. If you want to play with prime numbers, forget state machines, but remember that you can learn from things that don't work.

This probably means that our usual "base n" representation of numbers isn't well suited to deal with prime numbers. Nevertheless it show for sure that there is more to numbers than their representation. Writing a number in base n is only performing a simple bijection between the set of numbers and the set of their base n representation. Maybe great improvements will be achieved in prime number theory when a better representation technique is found for numbers.

Primality testing and prime number generation are topical subjects because it's the heart of cryptography, the only way to transmit secured documents today. The most efficient way to accomplish these two tasks nowadays is to use probabilistic methods. State machines are useful to find logic in apparently random languages, still they are very limited.

Why casinos have an upper limit on bets	I can't decide if what you're saying is too profound for me to comprehend or just insane	Proof that all odd numbers are prime	"1^p" p prime language
Palindromic numbers 101	regular language	Primality testing with Perl regexs	Prime number distribution
Perrin numbers	Primality Testing	one-dimensional cellular automaton	prime number theorem
Peano postulates	Signed English	Turing test	Earl of Aberdeen
Peril on the Bishop Ford	deterministic finite automaton	tie rod	pumping lemma
Priming	Zero	base 1

Recommended Reading

About Everything2

User Picks

Editor Picks

New Writeups