(There may or may not be writeups above which attempt to discuss this data structure. Obviously, I wasn't satisfied with them, and decided to present my own.)
NIST1 gives the following defintion for a trie:
A tree for storing strings in which there is one node for every common prefix. The strings are stored in extra leaf nodes.
Very succinctly put, but perhaps this leaves (pardon the pun) you a little cold. It's also more limiting than it needs to be.
A trie (also known as a "radix tree") is a type of search tree; that is, it maps values of a "key" type (in this case, strings) to values of another type, and it has a tree structure which increases the speed of searches at the cost of increased memory usage. Other types of search trees include btrees (and related structures like ISAM trees) and quadtrees, and the red-black tree structure which is the most common implementation of the C++ standard library collection std::map.
Tries are different from other search trees, in that the "key" type must be a string type of some sort. We mean a "string type" in a much more general sense than "character string" types such as char[] and std::string (although those are the most common key types used). In order to be considered a "string" type, values are always ordered sequence of some smaller "atom" type. In addition, the sequences are usually allowed to reach an arbitrary length. Thus, any sequence type (such as C++'s std::vector, std::list or std::deque) may be considered a "string".
Other search trees can use strings for their key types; however, tries exploit the fact of having strings for keys to radically (cough) speed up searches to O(1) complexity, more specifically O (keylength). They do this by unstringing
the elements of the key sequence and storing later elements further down in the trie. Each node of a trie represents a particular key string; all of the node's child nodes represent longer key strings, all of which contain with the parent node's key string as an initial section (or "prefix"). For example, consider the words
Aardvark
Armadillo
Armero
Arquebus
Astronaut
Bagel
Bakelite
Baker
Bastard
Bastion
Camembert
One logical "trie" structure for these words would look like this:
*
|
ABC
||`--------------------------------------------------.
|`----------------------------------. |
| | |
ARS A (AMEMBERT)
||`----------------------. |
|`--------. | |
| | | |
(RDVARK) MQ (TRONAUT) GKS
|`--------. ||`-----------.
| | |`---. |
| | | | |
AE (UEBUS) (EL) (ELITE) T
|`------. |
| | |
(DILLO) (RO) AI
|`----.
| |
(RD) (ON)
This particular trie would store the unique terminal sections of strings in special leaf nodes (represented by strings in parentheses), but that's just a detail I added to make the diagram more readable. The important thing to notice is that you read a trie's key strings "down" the tree, as opposed to a btree, which would store entire key values in each node.
So, what to we give up, in order to achieve faster searches? The tree's branching factor is much higher, resulting in a higher memory or disk space cost. In a simple trie implementation, each node would have as many branches as there are possible values of the atom type! In addition, tries are much deeper than btrees, and can be highly unbalanced, resulting in inconsistent search times. This means we should limit the use of tries to applications where
- The number of values of the atom type is low (e.g. characters, or perhaps just A-Z)
- The actual keys stored in the tree will be very dense in the logical key value space.
This pretty much restricts the use of tries to dictionaries and word lists.
Although the method of storing keys is the trie's key (ahem) feature, a trie is an associative array, and we should spend a little time discussing the value-type side of the array. Since each node represents one possible string in the key space, each node can associate with at most one possible value. This is probably where NIST's definition is too restrictive: In a tree that stores only strings, the "value type" is boolean (in the same manner that a C++ std::set<T> is like a std::map<T, bool>). But we can associate each string with any type we want.
A VERY simplistic trie class implementation follows, with most of the details and algorithms left to your imagination:
//
// Specialize several versions of TrieNode
// to take advantage of particular types'
// values meaning "no, nothing here"
//
template<class V> struct TrieNode;
template<> struct TrieNode<bool>;
template<> struct TrieNode<int>;
// similarly, for long int, etc.
template<class V> struct TrieNode
{
V *terminal;
TrieNode<V> *branch[256];
bool isvalue () { return (terminal != 0); }
V const &value() const { return *terminal; }
V &value () { if (!terminal) terminal = new V; return *terminal; }
};
template<> struct TrieNode<bool>
{
bool terminal;
TrieNode<bool> *branch[256];
bool isvalue () { return terminal; }
bool value() const { return terminal; }
bool &value() { return terminal; }
};
template<> struct TrieNode<int>
{
int terminal;
TrieNode<int> *branch[256];
bool isvalue () { return (terminal!=0); }
int value() const { return terminal; }
int &value() { return terminal; }
};
//
// The actual trie
// Notice my pathetic attempt to make it look
// like a standard C++ Container
//
template<class V>
class Trie
{
typedef TrieNode<V> *iterator;
typedef TrieNode<V> const *const_iterator;
typedef char const *key_type;
typedef V value_type;
TrieNode<V> *root;
const_iterator seek (key_type key) const;
//
// Non-const seek must "create nodes as it goes"
//
iterator seek (key_type key);
V const &operator[](key_type key) const
{
static V dummy();
const_iterator it = seek (key);
if (!it) return dummy;
}
V &operator[](key_type key)
{
return seek (key) -> value();
}
};
1National Institute of Standards and Technology: Dictionary of Algorithms and Data Structures
http://www.nist.gov/dads/HTML/trie.html