The Shift-And method, developed by Baeza-Yates and Gonnet, is one of many solutions to the exact string matching problem: given a string P called the pattern and a longer string T called the text, the exact string matching problem is to find all occurrences, if any, of pattern P in text T. This one is of partucluar interest because it uses bit arithmetic and operations rather than actual comparisons of characters in the pattern and the text.
We start off with these basic definitions and rules:
The pattern string P has size n.
The text string T has size m.
M is a binary valued array of size n by m+1; binary simply means each element in the array is either 1 or 0.
Entry M(i,j) is 1 if and only if the first i characters of P exactly match the i characters of T ending at character j. Otherwise, the entry is 0.
In other words, M(i,j) is 1 if and only if P[1..i] exactly matches T[j-i+1..j]. For example, if T = jabberwocky and P = erw, then M(1,5) = M(2,6) = M(3,7) = 1, and all other possible M(i,j)'s are equal to 0. Here's another example with the entire M array:
T = michiganmilitia
P = mi
M i--> | 1 | 2 |
1 | 1 | 0 |
2 | 0 | 1 |
3 | 0 | 0 |
4 | 0 | 0 |
5 | 0 | 0 |
6 | 0 | 0 |
7 | 0 | 0 |
8 | 1 | 0 |
9 | 0 | 1 |
10 | 0 | 0 |
11 | 0 | 0 |
12 | 0 | 0 |
13 | 0 | 0 |
14 | 0 | 0 |
One can easily see a pattern when comparing the first column to the second one; this pattern is very important in the upcoming solution to the problem. We can also note that where we find a 1 in the second column, we find the ending of an exact match between pattern and text. First, however, we need something else to aid in this calculation...
U(x) is an n-length binary vector for each letter in the alphabet of the pattern; it marks with a 1 each place in the pattern where the letter appears, and all other places in the pattern are marked with a 0.
Let's see an example of this so that the value of U(x) is clear.
P = michiganmilitia
U(i) = 010010000101010
We need one more function to pull off the Shift-And method, and here it is.
BitShift(j) is the vector derived from shifting the vector for column j (in the array M, remember) down by one position and setting the first bit to 1. The previous bit at position n disappears. In other words, BitShift(j) consists of 1 followed by the first n-1 bits of column j-1.
Here's an example of how BitShift works.
Now we have all we need to fill up the array M very quickly. The first column of M is all 0's, or, in the case that the first character of the pattern and the text match, a 1 followed by 0's. Once we have this first column, each successive column is calculated as follows:
Column M(k), being the kth column of M, is equal to the following binary expression:
M(k) = U(T(k)) AND BitShift(k-1)
This basically means that M(i,j) is only 1 if both U(T(i)) is 1 AND BitShift(i-1) is 1 as well.
Every instance of 1 in the final column of M is the ending to an exact match.
The AND operation of comparing word-sized (usually 8 bits) strings, on most computers, is extremely fast. So, if we have a pattern string with a number of characters equal to or fewer than the number of bits in a word on a particular machine, the AND comparisons can go extremely quickly, resulting in lightning-fast string matching.
This method is often used when comparing very small patterns against larger texts; a common place to find it is in the unix utility agrep, among other places.