Introduction

For those of you without a copy of Knuth's The Art of Computer Programming under their pillows, a brief introduction to the binary search algorithm:

Binary searching is generally the most effective way to find a given item of data in a collection of data
  • if that data is sorted on the key you're looking for,
  • if the collection provides random access to the data and
  • if the data itself is all you have to work with, i.e. there is no index or similar search-supporting data structure available to help.
In brief, binary searching involves
  1. starting in the middle of your data
  2. deciding if what you're looking for is in the first half (before the middle) or the second half (after the middle)
  3. repeating from step 1 starting with the half that your key is in.

This technique halves your search space with every iteration: In Computerese, you'd say that this is an O(log2n) search, meaning that search time increases with the base 2 logarithm of the amount of data. In other words, searching through a million data items will take you only twice as long as searching through a thousand.

The problem

I was recently faced with a task where my program was given a plain text file full of flight data for a number of airlines, and I had to give the user a list of airlines to choose from so he could control which one(s) would be processed. This flight data consisted mostly of lines which started with the airline code; all the flights of an airline were together one after another, then the next airline's data would start. Here's a small mock sample:

AC  027  07APR02 20OCT02       7 YYZ 2225 HNL 0810
AC  028  08APR02 26AUG02 1       HNL 1130 YYZ 2025
AC  115  18NOV01 31MAR02       7 YYZ 2230 YVR 0331
AF 1139  01DEC01 22FEB02 1234567 VIE 0910 CDG 1115
AT  202  16FEB02 21FEB02  2 4 6  CMN 1230 JFK 2100
AT  203  16FEB02 21FEB02  2 4 6  JFK 2230 CMN 0545

What makes this problem interesting is the sheer volume of data. This program could be used on the data that goes into making a semi-annual flight catalog for travel bureaus and other flight vendors. We're talking maybe 100 airlines here, in 300,000 lines and 80 Megabytes. Also, my program doesn't get a chance to prepare for the onslaught: The user finds it a file to work on, clicks on it and expects to see the list of available airlines before he's had a chance to lean back in his chair. Another interesting kink is that the data in the file is not necessarily in order of ascending airline codes.

In this day and age of databases, many programmers would take one of these no-brainer approaches:

  • Read through the file from top to bottom, keeping track of all airlines in a list; or
  • Copy the data into a database, index the database and then extract the airlines as unique keys.

The first approach is unsatisfactory because it's too slow. It simply takes too long to read 300,000 lines of data. The second approach is even slower and would have required me to sell a database program with my own product.

I built a mockup of my program using method 1, just to get a feel for the timing. Sequential processing took 44 seconds, much too long for customers spoiled by Instant Everything.

The solution

I thank Professor Heller for teaching me the basics in Computer Science 101, back in 1979. One of the techniques I learned and have often put to good use is, of course, the binary search.

At this point you may object that

  • I don't know what I'm searching for, and
  • The airline codes are not necessarily sorted.

This is true but both issues turned out to be amenable to creative thinking. With "interesting" problems like this, it is vital to formulate very clearly what the goal is. I wish to find all airlines without reading every single line. Since all the data within one airline is together, my only worry is missing an entire airline. To avoid missing an airline, what I was searching for was the break between a given airline and the next. Now I had something to search for!

The approach

  1. Read the first line to discover the first airline. Add it to the empty list. Call the current line's position P1.
  2. Read the last line to discover the last airline. Call the current line position P9 (it's a long way from P1). If the airline is the same as the first airline, we're done.
  3. Read the line midway between P1 and P9. Call the current line position P5 (it's midway between P1 and P9). If the airline at P5 is different from the one at P1, then the break is between P1 and P5. So set P9 to P5 and go back to Step 3.
  4. The airline at P5 is the same as the one at P1, so the break must be between P5 and P9. Set P1 to P5 and go back to Step 3.
  5. After a few repeats, P1 and P9 will be right next to each other; our "current" airline will be at P1 and the next airline will be at P9. Because they're successive lines, we know there's no other airline in between. So add the airline at P9 to our list, set P1 to P9 and go back to Step 2.
  6. Sooner or later, P1 will be right up next to the last line. We're done!

Comments

I have left out a number of details for simplicity and brevity. One of these concerns my treatment of lines that aren't all the same length, and the related problem of coming up with a P5 in the middle of a line rather than at the beginning. Also, the "original" algorithm states that P1 should be set to P5 + 1 or P9 to P5 - 1. This is an algorithm that needs to be seasoned to taste.

The running time for the final version of my program was under 3 seconds - mission accomplished!

Say we have a bounded infinite set S on the real line. That is to say, S contains infinitely many points, but all those points are contained within a bounded interval; go out far enough on the real line in either direction and eventually you'll reach a location beyond which there are no more points in S. As it turns out, there has to be a point on the real line such that any open interval containing that point contains infinitely many points in S. Why? Read on.

If you ever take a course in real analysis (please note that 10998521's writeup on this is absolutely wham-bang boffo fantastic), the probability is as high as 75% that your professor will at some point ask you, with a wry smile, "Do you know how to catch a lion?"

In all likelihood, at this point in the course you will be thoroughly tired of being made to put up with such ridiculous shit as proving the (patently obvious, right? Right?) Intermediate Value Theorem and fiddling about endlessly with deltas and epsilons, and you will cringe at the very thought of what horrific and complex concept might be insidiously concealed within the innocuous procedure of catching a lion. You might prefer going to Africa, armed with nothing but a spear, a loincloth, and your wits, to whatever it is your professor is about to subject you to.

Fear not, because not only is the procedure for lion-catching intimately familiar to any computer science major (at least, it should be), it might just give you as much bang (in terms of usefulness in problem solving) for your buck (in terms of difficulty in understanding) as any other concept in real analysis.

To catch a lion in a field, or a desert, or what have you (let's call it a region, just to keep things general), divide the region in half. Pick the half with the lion in it and divide that half in half. Lather. Rinse. Repeat. This process is described in this node (in section 1.4, "The Bolzano-Weierstrass method"), along with several other ha-ha that's funny science-inspired methods of leonine appropriation. This here, however, is serious business.

This is a binary search, but in the context of a continuous rather than a discrete search space. In a discrete search space of size N, you can find the lion, on average, in log2 N steps. In a continuous search space... wellllll.

  • Long story shortest: Keep doing this and you keep narrowing in on the location of the lion. You'll never find it exactly, but you can get as close as you want.1

  • Slightly longer story: After the nth iteration of this search, your region is (1/2n)th of its original size. It's pretty well-known that the sequence given by Sn = (1/2n) converges to 0; the only region with area of 0 is a point. That point is where the lion is. Better nuke it from orbit; it's the only way to be sure.

  • Longest story I'm willing to tell: The fact that there is a lion to find, so to speak, illustrates the Nested Interval Property, one of many equivalent ways of expressing the most fundamental fact about the real numbers. Put simply, this property is that the real line has no holes in it, as opposed to the rational number "line" which has holes at irrational numbers like π and e.

    The Nested Interval Property states that given a sequence of intervals each of which contains the next, such that the size of the intervals gets arbitrarily small (i.e. "as small as we want") if we go out far enough in the sequence, there is a point which is in all the intervals. This property does not hold for the rational numbers or any smaller set. Because the intervals we have here get their sizes halved every time we iterate, we can get them as small as we want, and so the Nested Interval Property applies here. The point it describes is where the lion is.

Commend yourself if you're not asking what the point is by now, no pun intended. To be perfectly honest, the Nested Interval Property was the big idea here. If we look at the example from the beginning of this writeup, however, we see how the Nested Interval Principle makes it trivial.

Trivial how? Well, take the interval containing that set S. It contains infinitely many points of S. Divide it in half. At least one of those halves contains infinitely many points of S; if they both contained only finitely many points, S itself would have only finitely many points; a contradiction. Take the half containing infinitely many points of S and divide it in half. If both contain infinitely many points in S, just pick one.

You know the rest. Thanks to the binary search algorithm, you know how to catch a lion. The lion is the point in "all the halves," and the fact that all intervals containing it have an infinite number of points from S is a direct consequence of the method we used. What we just informally proved is noded somewhat more technically under Every infinite subset of (0,1) has a limit point. At the risk of self-aggrandizement I have to say I prefer the approach described here.


1 "Getting as close as you want" is an important concept in real analysis. What analysts mean when they say something like this (which they won't, because it's so painfully imprecise, but if they did) is "pick any small value (usually ε, epsilon, is used to represent this value); you can get even closer than that." This is one of the contrived-sounding but essential hacks that have been devised to do things like avoid infinity and excise ambiguity from calculus and its ilk.

Log in or register to write something here or to contact authors.