Lexical Analyzer

Introduction

A lexical Analyzer makes up the first stage of processing that a compiler performs on input source code. The characters making up the source code are read one at a time and translated into tokens. For example if the lexical analyzer reads the character '=' it could pass on the token to the next stage of processing: the syntax analyzer (tokens are often represented as ints). As well as simple characters like punctuation and mathematical symbols, the lexical analyzer must recognise longer strings of characters which make up keywords, identifiers and constants. The lexical analyzer should not tokenize white space (assuming it is unimportant to the meaning of the language) or comments. At this stage in the compilation process, the task is easy, the lexical analyzer only has to recognise that the code input to it contains legal elements that can be tokenized, it does not have to check that they are in a legal order - that task is left to the syntax analyzer.

Writing Lexical Analyzers

It is possible to write a lexical analyzer by hand or by using a lexical analyser construction tool such as lex or Flex. To write a lexical analyzer by hand, there are two different methods, the ad hoc method, and the formal method

Ad hoc method

The ad hoc way of writing a lexical analyzer is based on the following basic approach. Start reading characters into a string (or char array or similar) one at a time. If the character you just read is denoted as a single character in your source language, then stop reading after one char, find out exactly what it is, then send its token to the syntax analyzer. If the character you read is part of a larger token e.g. it starts with the letter 'i', then carry on reading characters into your string until a different type of character is reached, such as white space or a piece of punctuation. Again, once the whole of this string of characters is read it can be identified. It is important to remember that the keyword 'int', for example, cannot be matched until the character after it is read - it may in fact turn out to be an identifier 'int1'.

When an identifier or a constant is found, as well as sending the token to the syntax analyzer, its value must also be remembered as there is probably more than one different identifier in the program, for example. Usually this is done by adding the values into the symbol table.

Formal method

The formal method automates the process of writing a lexical analyzer, this is also similar how lexical analyzer construction tools work. Regular expressions of tokens are written (this would be given as an input to a lexical analyzer construction tool). These regular expressions are then translated into a finite state machine (FSM) by labeling its edges with characters which cause transitions.
If the following token matching expressions are defined:
<IDENTIFIER> <- letter(letter|digit)* (the token <IDENTIFIER> is matched if the string of a letter followed by any number of letters or digits is found)
letter <- ["a"-"z"] (a letter is defined as a lowercase letter from a-z)
digit <- ["0"-"9"] (a digit is defined as an int from 0-9)
It can be translated into a FSM thus:

                                                   _____
                    __                            | ___ |
                   |  |----E--------------------->||END||
   _____           |  |           __              ||   ||
->|START|--letter->|  |--letter->|  |-------E---->||   ||
  |_____|          |  |<----E--- |__|             ||   ||
                   |  |           __              ||___||
                   |  |--digit-->|  |-------E---->|_____|
                   |__|<----E--- |__|

(E denotes empty moves - usually denoted with the Greek letter Epsilon - which is not recognised by all popular browsers)

This is further refined into a deterministic finite state machine (DFA) (DFAs have no Epsilon moves and no two moves with the same value from the same state) Using the same example as above, this can be achieved thus:

                     _____
                    | ___ |----letter--*
   _____            ||END||            |
->|START|--letter-->||   ||<---------- *
  |_____|           ||   ||----digit---*
                    ||___||            |
                    |_____|<-----------*

Code can be systematically written from a DFA, for example, by writing a switch statement with a case for each token to match.

Issues in writing a lexical analyzer

Big numbers:
When recognising large numbers in the input, there is a chance that a number could be larger than the computer can store. There is no way of reading in a number, then testing it, because there is no way of storing it. For this reason, numbers must be stored as Strings by the lexical analyzer, their length can then be tested later on in the compilation process.
Errors in input:
If an illegal character or collection of characters is found, the lexical analyzer should report an error to the user. It should also try to continue checking the rest of the program for extra errors, which means that it must find a way to re-synchronize itself with the input code. One way of doing this would be to skip input until a character which legally starts a token is found.
Languages where keywords are not reserved:
If keywords such as else are not reserved in the programming language, the lexical analyzer has no way of telling whether a string "else" is a keyword or an identifier. This means that the program must check the context in which the string appears, which is normally the task of the syntax analyzer - where this new checking comes into the program is up to the individual programmer (either way it is going to be messy - this is also an issue for defining programming languages and shows why those who know how to write compilers should be consulted).

Swiss-Army chainsaw	compiler	Perl	null-terminated number
Destruction of Syntax - Imagination without Strings - Words-in-Freedom	Finite state machine	Lex	Symbol table
Yacc	ANTLR	Flex	token
ambiguous grammar	compilation	parser	regular expression
source code	Modron	ad hoc	Glib
Chatterbox	Java Jive	char	csh

Introduction

Writing Lexical Analyzers

Ad hoc method

Formal method

Issues in writing a lexical analyzer

Big numbers:

Errors in input:

Languages where keywords are not reserved: