Introduction
A lexical Analyzer makes up the first stage of processing that a compiler performs on input source code. The characters making up the source code are read one at a time and translated into tokens. For
example if the lexical analyzer reads the character '=' it could pass on the token to the next stage of processing: the syntax analyzer (tokens are often
represented as ints). As well as simple characters like punctuation and mathematical symbols, the lexical analyzer must recognise longer strings of characters which make up
keywords, identifiers and constants. The lexical analyzer should not tokenize white space (assuming it is unimportant to the meaning of the language) or comments. At this stage in the compilation
process, the task is easy, the lexical analyzer only has to recognise that the code input to it contains legal elements that can be tokenized, it does not have to check that they are in a legal order - that task is left to the syntax analyzer.
Writing Lexical Analyzers
It is possible to write a lexical analyzer by hand or by using a lexical analyser construction tool such as lex or Flex. To write a lexical analyzer by hand, there are two different methods, the ad hoc method, and the formal method
Ad hoc method
The ad hoc way of writing a lexical analyzer is based on the following basic approach. Start reading characters into a string (or char
array or similar) one at a time. If the character you just read is denoted as a single character in your source language, then stop reading after one char,
find out exactly what it is, then send its token to the syntax analyzer. If the character you read is part of a larger token e.g. it starts with the
letter 'i', then carry on reading characters into your string until a different type of character is reached, such as white space or a piece of
punctuation. Again, once the whole of this string of characters is read it can be identified. It is important to remember that the keyword 'int', for
example, cannot be matched until the character after it is read - it may in fact turn out to be an identifier 'int1'.
When an identifier or a constant is found, as well as sending the token to
the syntax analyzer, its value must also be remembered as there is probably
more than one different identifier in the program, for example. Usually
this is done by adding the values into the symbol table.
Formal method
The formal method automates the process of writing a lexical analyzer, this
is also similar how lexical analyzer construction tools work. Regular
expressions of tokens are written (this would be given as an input to a
lexical analyzer construction tool). These regular expressions are then
translated into a finite state machine (FSM) by labeling its edges with
characters which cause transitions.
If the following token matching expressions are defined:
<IDENTIFIER> <- letter(letter|digit)* (the token <IDENTIFIER> is matched if
the string of a letter followed by any number of letters or digits is
found)
letter <- ["a"-"z"] (a letter is defined as a lowercase letter from a-z)
digit <- ["0"-"9"] (a digit is defined as an int from 0-9)
It can be translated into a FSM thus:
_____
__ | ___ |
| |----E--------------------->||END||
_____ | | __ || ||
->|START|--letter->| |--letter->| |-------E---->|| ||
|_____| | |<----E--- |__| || ||
| | __ ||___||
| |--digit-->| |-------E---->|_____|
|__|<----E--- |__|
(E denotes empty moves - usually denoted with the Greek letter Epsilon - which
is not recognised by all popular browsers)
This is further refined into a deterministic finite
state machine (DFA) (DFAs have no Epsilon moves and no two moves with the
same value from the same state) Using the same example as above, this can be
achieved thus:
_____
| ___ |----letter--*
_____ ||END|| |
->|START|--letter-->|| ||<---------- *
|_____| || ||----digit---*
||___|| |
|_____|<-----------*
Code can be systematically written from a DFA, for example, by writing a
switch statement with a case for each token to match.
Issues in writing a lexical analyzer
Big numbers:
When recognising large numbers in the input, there is a chance that a number could be larger than the computer can store. There is no way of reading in a number, then testing it, because there is no way of storing it. For this reason, numbers must be stored as Strings by the lexical analyzer, their length can then be tested later on in the compilation process.
Errors in input:
If an illegal character or collection of
characters is found, the lexical analyzer should report an error to the
user. It should also try to continue checking the rest of the program for
extra errors, which means that it must find a way to re-synchronize
itself with the input code. One way of doing this would be to skip input
until a character which legally starts a token is found.
Languages where keywords are not reserved:
If keywords such
as else are not reserved in the programming language, the lexical
analyzer has no way of telling whether a string "else" is a keyword or an
identifier. This means that the program must check the context in which
the string appears, which is normally the task of the syntax analyzer -
where this new checking comes into the program is up to the individual
programmer (either way it is going to be messy - this is also an issue
for defining programming languages and shows why those who know how to
write compilers should be consulted).