Introduction

For those of you without a copy of Knuth's The Art of Computer Programming under their pillows, a brief introduction to the binary search algorithm:

Binary searching is generally the most effective way to find a given item of data in a collection of data
  • if that data is sorted on the key you're looking for,
  • if the collection provides random access to the data and
  • if the data itself is all you have to work with, i.e. there is no index or similar search-supporting data structure available to help.
In brief, binary searching involves
  1. starting in the middle of your data
  2. deciding if what you're looking for is in the first half (before the middle) or the second half (after the middle)
  3. repeating from step 1 starting with the half that your key is in.

This technique halves your search space with every iteration: In Computerese, you'd say that this is an O(log2n) search, meaning that search time increases with the base 2 logarithm of the amount of data. In other words, searching through a million data items will take you only twice as long as searching through a thousand.

The problem

I was recently faced with a task where my program was given a plain text file full of flight data for a number of airlines, and I had to give the user a list of airlines to choose from so he could control which one(s) would be processed. This flight data consisted mostly of lines which started with the airline code; all the flights of an airline were together one after another, then the next airline's data would start. Here's a small mock sample:

AC  027  07APR02 20OCT02       7 YYZ 2225 HNL 0810
AC  028  08APR02 26AUG02 1       HNL 1130 YYZ 2025
AC  115  18NOV01 31MAR02       7 YYZ 2230 YVR 0331
AF 1139  01DEC01 22FEB02 1234567 VIE 0910 CDG 1115
AT  202  16FEB02 21FEB02  2 4 6  CMN 1230 JFK 2100
AT  203  16FEB02 21FEB02  2 4 6  JFK 2230 CMN 0545

What makes this problem interesting is the sheer volume of data. This program could be used on the data that goes into making a semi-annual flight catalog for travel bureaus and other flight vendors. We're talking maybe 100 airlines here, in 300,000 lines and 80 Megabytes. Also, my program doesn't get a chance to prepare for the onslaught: The user finds it a file to work on, clicks on it and expects to see the list of available airlines before he's had a chance to lean back in his chair. Another interesting kink is that the data in the file is not necessarily in order of ascending airline codes.

In this day and age of databases, many programmers would take one of these no-brainer approaches:

  • Read through the file from top to bottom, keeping track of all airlines in a list; or
  • Copy the data into a database, index the database and then extract the airlines as unique keys.

The first approach is unsatisfactory because it's too slow. It simply takes too long to read 300,000 lines of data. The second approach is even slower and would have required me to sell a database program with my own product.

I built a mockup of my program using method 1, just to get a feel for the timing. Sequential processing took 44 seconds, much too long for customers spoiled by Instant Everything.

The solution

I thank Professor Heller for teaching me the basics in Computer Science 101, back in 1979. One of the techniques I learned and have often put to good use is, of course, the binary search.

At this point you may object that

  • I don't know what I'm searching for, and
  • The airline codes are not necessarily sorted.

This is true but both issues turned out to be amenable to creative thinking. With "interesting" problems like this, it is vital to formulate very clearly what the goal is. I wish to find all airlines without reading every single line. Since all the data within one airline is together, my only worry is missing an entire airline. To avoid missing an airline, what I was searching for was the break between a given airline and the next. Now I had something to search for!

The approach

  1. Read the first line to discover the first airline. Add it to the empty list. Call the current line's position P1.
  2. Read the last line to discover the last airline. Call the current line position P9 (it's a long way from P1). If the airline is the same as the first airline, we're done.
  3. Read the line midway between P1 and P9. Call the current line position P5 (it's midway between P1 and P9). If the airline at P5 is different from the one at P1, then the break is between P1 and P5. So set P9 to P5 and go back to Step 3.
  4. The airline at P5 is the same as the one at P1, so the break must be between P5 and P9. Set P1 to P5 and go back to Step 3.
  5. After a few repeats, P1 and P9 will be right next to each other; our "current" airline will be at P1 and the next airline will be at P9. Because they're successive lines, we know there's no other airline in between. So add the airline at P9 to our list, set P1 to P9 and go back to Step 2.
  6. Sooner or later, P1 will be right up next to the last line. We're done!

Comments

I have left out a number of details for simplicity and brevity. One of these concerns my treatment of lines that aren't all the same length, and the related problem of coming up with a P5 in the middle of a line rather than at the beginning. Also, the "original" algorithm states that P1 should be set to P5 + 1 or P9 to P5 - 1. This is an algorithm that needs to be seasoned to taste.

The running time for the final version of my program was under 3 seconds - mission accomplished!