In data structure terminology, a trie is a 26-ary tree useful for the storage and retrieval of alphabetic strings. Each node in the trie contains a list of 26 node pointers, corresponding to letters of the alphabet, as well as any data field(s) pertinent to the strings being stored.

To add a string to the trie, one starts at the root node, and follows the node pointers corresponding to each letter of the string in turn, adding new nodes as needed. When the string has been consumed, the current node will represent this particular string, so any data the programmer wishes to associate with it should be placed in this node. Or, if the trie is just meant to store strings (for instance, if it holds a word processor's spell checking dictionary), the data of the last node should merely indicate the end of a string. Thus, it can be seen that most nodes are merely stepping-stones to the end of a string or strings.

Similarly, to locate the data associated with a string (or to see if the trie contains the string at all), start at the root and follow the branch dictated by each character of the string in turn. The data of the last node arrived at (if any) is what you're looking for.

Here's an example implementation of a generic trie in C:

typedef struct trie_node {
	void *data;
	struct trie_node *branch[26];
} trie_node;

trie_node *new_trie_node( void )
{
	trie_node *new;
	int i;
	new = (trie_node*)malloc(sizeof(trie_node));
	new->data = NULL;
	for ( i = 0; i < 26; i++ )
		new->branch[i] = NULL;
}

void trie_add( trie_node *node, char *s, void *data )
/* add string s to the trie and attach data to it */
{
	if ( !*s ) node->data = data;
	else {
		int idx = *s - 'a';
		if ( !node->branch[idx] )
			node->branch[idx] = new_trie_node();
		trie_add(node->branch[idx], s+1, data);
	}
}

void *trie_get( trie_node *node, char *s )
/* return data associated with string s, or NULL if not found */
{
	if ( !node ) return NULL;
	if ( !*s ) return node->data;
	return trie_get(node->branch[*s - 'a'], s+1);
}

Note in trie_get that although the second base case might be reached and satisfied, this does not necessarily mean "success," because the string being sought could be exhausted before we reach a node with any data. This would indicate a partial match; in other words, the string being sought was a prefix of one or more strings contained in the trie, but not an actual entry. One neat side-effect of the trie's structure is that it lends itself to string completion (getting a list of possible matches for a given prefix.)

Initially, one might be tempted to think that each node should keep a record of what character it represents, but this is unnecessary; the structure of the trie itself dictates what each node represents. Or, put differently, how you got to this node defines the string so far. Any path from the root node to a node with non-empty data delimits a string.

Complexity of the add and get operations is O(n), where n is the length of the string in question.

Although a trie is most commonly useful for dictionaries of words, you may desire a larger set of characters than just 26 letters. In such a case, expanding the code above to include arrays of size 256 (to encompass the entire extended ASCII table, for example) would be impractical, for memory usage will go through the roof--even for relatively small data sets. Converting that to a linked list would help minimize memory usage but could significanly increase search times.

As an alternative to using an array or a linked list to store each token in a given trie entry, some form of tree can be used instead (hint: an avl tree or a binary tree). Using an avl tree, for example, will reduce search times for a given token of an entry from O(n) (using a linked list) to O(log n) as well as minimize memory usage over that of the array approach. As an additional bonus, dictionaries can be constructed from entries whose elements are of variable size. Such a construct might be useful in something akin to a compression algorithm's dictionary or some obscure indexing algorithm that does weird lookups. At the slight expense of creation (and possibly deletion) overhead, this type of construct could provide a great balance between speed, memory usage, and element size--specifically with particularly large data sets.

(There may or may not be writeups above which attempt to discuss this data structure. Obviously, I wasn't satisfied with them, and decided to present my own.)

NIST1 gives the following defintion for a trie:

A tree for storing strings in which there is one node for every common prefix. The strings are stored in extra leaf nodes.

Very succinctly put, but perhaps this leaves (pardon the pun) you a little cold. It's also more limiting than it needs to be.

A trie (also known as a "radix tree") is a type of search tree; that is, it maps values of a "key" type (in this case, strings) to values of another type, and it has a tree structure which increases the speed of searches at the cost of increased memory usage. Other types of search trees include btrees (and related structures like ISAM trees) and quadtrees, and the red-black tree structure which is the most common implementation of the C++ standard library collection std::map.

Tries are different from other search trees, in that the "key" type must be a string type of some sort. We mean a "string type" in a much more general sense than "character string" types such as char[] and std::string (although those are the most common key types used). In order to be considered a "string" type, values are always ordered sequence of some smaller "atom" type. In addition, the sequences are usually allowed to reach an arbitrary length. Thus, any sequence type (such as C++'s std::vector, std::list or std::deque) may be considered a "string".

Other search trees can use strings for their key types; however, tries exploit the fact of having strings for keys to radically (cough) speed up searches to O(1) complexity, more specifically O (keylength). They do this by unstringing the elements of the key sequence and storing later elements further down in the trie. Each node of a trie represents a particular key string; all of the node's child nodes represent longer key strings, all of which contain with the parent node's key string as an initial section (or "prefix"). For example, consider the words

Aardvark
Armadillo
Armero
Arquebus
Astronaut
Bagel
Bakelite
Baker
Bastard
Bastion
Camembert

One logical "trie" structure for these words would look like this:

         *
         |
         ABC
         ||`--------------------------------------------------.
         |`----------------------------------.                |
         |                                   |                |
         ARS                                 A                (AMEMBERT)
         ||`----------------------.          |
         |`--------.              |          |
         |         |              |          |
         (RDVARK)  MQ             (TRONAUT)  GKS
                   |`--------.               ||`-----------. 
                   |         |               |`---.        | 
                   |         |               |    |        |
                   AE        (UEBUS)         (EL) (ELITE)  T
                   |`------.                               |
                   |       |                               |
                   (DILLO) (RO)                            AI
                                                           |`----.
                                                           |     |
                                                           (RD)  (ON)

This particular trie would store the unique terminal sections of strings in special leaf nodes (represented by strings in parentheses), but that's just a detail I added to make the diagram more readable. The important thing to notice is that you read a trie's key strings "down" the tree, as opposed to a btree, which would store entire key values in each node.

So, what to we give up, in order to achieve faster searches? The tree's branching factor is much higher, resulting in a higher memory or disk space cost. In a simple trie implementation, each node would have as many branches as there are possible values of the atom type! In addition, tries are much deeper than btrees, and can be highly unbalanced, resulting in inconsistent search times. This means we should limit the use of tries to applications where

  • The number of values of the atom type is low (e.g. characters, or perhaps just A-Z)
  • The actual keys stored in the tree will be very dense in the logical key value space.

This pretty much restricts the use of tries to dictionaries and word lists.

Although the method of storing keys is the trie's key (ahem) feature, a trie is an associative array, and we should spend a little time discussing the value-type side of the array. Since each node represents one possible string in the key space, each node can associate with at most one possible value. This is probably where NIST's definition is too restrictive: In a tree that stores only strings, the "value type" is boolean (in the same manner that a C++ std::set<T> is like a std::map<T, bool>). But we can associate each string with any type we want.

A VERY simplistic trie class implementation follows, with most of the details and algorithms left to your imagination:


//
// Specialize several versions of TrieNode
// to take advantage of particular types'
// values meaning "no, nothing here"
//

template<class V> struct TrieNode;
template<> struct TrieNode<bool>;
template<> struct TrieNode<int>;
// similarly, for long int, etc.

template<class V> struct TrieNode
{
 V *terminal;
 TrieNode<V> *branch[256];
 bool isvalue () { return (terminal != 0); }
 V const &value() const { return *terminal; }
 V &value () { if (!terminal) terminal = new V; return *terminal; }
};


template<> struct TrieNode<bool>
{
 bool terminal;
 TrieNode<bool> *branch[256];
 bool isvalue () { return terminal; }
 bool value() const { return terminal; }
 bool &value() { return terminal; }
};


template<> struct TrieNode<int>
{
 int terminal;
 TrieNode<int> *branch[256];
 bool isvalue () { return (terminal!=0); }
 int value() const { return terminal; }
 int &value() { return terminal; }
};

//
// The actual trie
// Notice my pathetic attempt to make it look
// like a standard C++ Container
//

template<class V> 
class Trie
{
 typedef TrieNode<V> *iterator;
 typedef TrieNode<V> const *const_iterator;
 typedef char const *key_type;
 typedef V value_type;
 TrieNode<V> *root;
 const_iterator seek (key_type key) const;
 //
 // Non-const seek must "create nodes as it goes"
 //
 iterator seek (key_type key); 
 V const &operator[](key_type key) const
 {
  static V dummy();
  const_iterator it = seek (key);
  if (!it) return dummy;
 }
 V &operator[](key_type key)
 {
  return seek (key) -> value();
 }
};

1National Institute of Standards and Technology: Dictionary of Algorithms and Data Structures
http://www.nist.gov/dads/HTML/trie.html

Log in or registerto write something here or to contact authors.