skip list - Everything2.com

Skip lists are a randomized variant of linked lists, designed by Bill Pugh as an improvement over or replacement for balanced trees. The goal of this data structure is to have search times, insertion times, and deletion times all growing with O(log n), where n is the number of elements in the list.

As described elsewhere (linked lists,binary search tree,BST disease, etc), simple binary search trees have the problem that they can become unbalanced, depending on insertion and deletion order. This causes the growth rate of search times to increase (to the worst case, of checking every node, as in a linked list). Variations on these trees have been designed to limit or combat the degradation. However, these variations suffer from increased complexity, increased cost of certain operations (deletion, most of the time), and other problems.

Enter the skip list. An example of a skip list with a few nodes inserted already might be pictured like this:

 ---                             ---
|   |-------------------------->|   |
 ---             ---             ---
|   |---------->|   |---------->|   |
 ---             ---             ---
|   |---------->|   |---------->|   |
 ---     ---     ---     ---     ---
| 1 |-->| 3 |-->| 4 |-->| 7 |-->| 9 |
 ---     ---     ---     ---     ---

The bottom layer looks like - and is - a simple linked list. The layers above it also connect nodes in order (in this case, incresing numeric value), but do not connect all of the nodes.

To search this list for a particular element, you would start at the highest layer of the first node. Assume you're search for seven (7). You would follow the link once, and compare the value of it with the value you are seeking. In this case, the value you find is nine (9). As that is larger than seven, go back to the link you just traversed (from 1 to 9), and drop down a layer. Again follow a link to the right, and compare its value. As it is four, which is less than seven, you make that your new start point, and continue along the list. Follow the link at the same layer you followed to get to four, and compare the value. As it is bigger, we drop layer again, follow again, etc. Finally, we reach seven. If we were searching for six instead of seven, we would have reached seven, at the bottom layer without finding six; This implies intuitively that it will take the same amount of time to determine if the element is or is not in the list.

In this example, searching seems to be less efficient than simply checking each element in turn, one by one. However, envision the optimum list of this type with, say, a thousand nodes. There would be a node with all possible layers half way between the endnodes. At each quarter marker, a node would have all but the highest layer; at each eighth, all but the highest two, and so forth. Search would be almost exactly proportional to the log of the number of nodes, in the worst case (roughly 10), whereas searching each element as in a linked list would take an average of half the nodecount, and in the worst case all nodes would need to be checked.

Deletion from this linked list takes as long as a search, plus some constant (that is, O(log n)). Deletion involves finding the element you wish to delete, and then pointing all links which point at it towards the nodes which it points at on that layer. Looking at the example above, if we deleted 4, the resulting list would look like this:

 ---                             ---
|   |-------------------------->|   |
 ---                             ---
|   |-------------------------->|   |
 ---                             ---
|   |-------------------------->|   |
 ---     ---             ---     ---
| 1 |-->| 3 |---------->| 7 |-->| 9 |
 ---     ---             ---     ---

Simple, really, especially compared to deleting a node in a binary search tree. Edge cases, such as deleting the first or last element can be dealt with by having a header and/or a tail node that holds no value, pointing the last link on a layer at NULL, or using other, similar tricks.

Insertion is where the randomized nature of the data structure is. Decide how many layers a new node you want to insert will have, 'randomly'. One way of doing this is a coin-flipping method. Each new node will have a link in the bottom layer. Flip a coin. If it is heads (true/1/on), it will have a link in the next layer as well. If it is tails, stop. Over a large number of nodes, half of the nodes will have at least two links from them, a quarter will have at least three, and so on up. Insertion order does not affect things, as the number of links in each node is not dependent on what other nodes are in the list.

The key to the structure - this will be hard to swallow - is that once a large number of nodes have been entered into the list, it will approach the optimum case in terms of search, insert, and delete times and growth rates of the same in relation to list size. I am, myself, unable to prove it, though others have.

Looking at the first example diagram, an insertion might happen as follows - the node you wish to insert is value six; you flip that imaginary coin twice: the first time it comes up heads, so you'll put a second layer on the node, and the second throw it comes up tails. Onward.

You'd find the links to insert it much the same way you'd search for any other element, but you'd keep track of the last node not greater than the element at each layer. Once you've reached the bottom layer, insert the element - point the new node, at each layer it has, at whatever the last node not greater than it is currently linked to. Then point the last node not greater at the new node, for each layer the new node has. Quick and easy. Inserting six, with two layers, would end with the list looking like this:

 ---                                     ---
|   |---------------------------------->|   |
 ---             ---                     ---
|   |---------->|   |------------------>|   |
 ---             ---     ---             ---
|   |---------->|   |-->|   |---------->|   |
 ---     ---     ---     ---     ---     ---
| 1 |-->| 3 |-->| 4 |-->| 6 |-->| 7 |-->| 9 |
 ---     ---     ---     ---     ---     ---

Skip lists are just one example of how randomness - and probability - can simplify a problem and improve the result.

Most of this was pulled out of my head from when I coded a couple of them, though the essential nature of it is unchanged. I'm sure there are variations between what I described here and in Pugh's original paper on the subject. Anything that is too out of line, let me know and I'll fix.

randomized algorithm	BST disease	How to flip a coin when you haven't got one	Random Binary Search Tree
hash table	cross-linked list	Red Black Tree	empty words
Big O	Rock Action	splay tree	SAS
binary search tree	binary tree	data structure	Big-oh notation
linked list	null