The law governing the distribution of the first (most-significant, leftmost) digit of numbers arising from an unbounded distribution (i.e. it works for lengths of rivers, but not for phone numbers). Contrary to initial expectations, the leftmost digit is a `1' in more than 30% of the numbers, and `9' in less than 4.6%! The actual fraction of numbers starting in digit d is log10(d+1)-log10(d) (note that no numbers start with a `0'). For different bases, change the base of the logarithm; bases like 1000 are relevant to the decimal case, too.

What's going on? Well, suppose a distribution actually exists for numbers like the lengths of rivers. Clearly it can't be related to the units used to measure the length, so multiplying by a constant can't change matters. So it has to be logarithmic.

If you know any probability, you realise that the above paragraph is meaningless! There is no unbounded uniform distribution, which would seem to work best with the argument. And you cannot prove anything about the distribution if it doesn't exist! Nonetheless, people have managed to prove some versions of Benford's Law, by making "reasonable" assumptions. And empirically, it works!

This is the law:
1---------------2---------3-------4-----5----6---7--8--9
Or, the probability of the first digit in a number in statistical data being a "d" is
                  log_10( 1+ (1/d) )
The law was first discovered by a mathematician named Simon Newcomb in 1881, when he discovered that in book with logarithmic tables, the pages with lower numbers were more worn than the pages with higher numbers. But it took until 1938 when Frank Benford published the results from analysis of more than 20000 numbers from various sources; price lists, electrical bills and street addresses.

In 1996 professor Ted Hill suggested an explanation of this phenomenon. The law applies to large amounts of numbers that are derived from somewhere else, not on random numbers. If we take several sources of random numbers - which are all each distributed to a normal distribution or some other random distribution - and combine these numbers, the would be a distribution of distributions. This distribution is Benford's law, and it has most numbers in the lower part of the range.

Benford's law is commonly used by accounting firms and others who work with large amounts of numbers. If the numbers aren't tampered with, the first digit should be 1 in 30.1% of the numbers and 9 in only 4.6% of them. If the distribution is more even, then something is probably wrong... Of course, some numbers are by nature more common than others, such as amounts of $24, which happens to be the largest amount you can expense report in America, without having a receipt.

Another slightly counterintuitive fact from probability statistics is that improbable results do occur more often that you'd think. Theodore Hill used to give his students the following homework: Flip a coin 200 times and write down the results. Many of the students got tired after 20 flips, and just wrote down made-up results, that they thought would seem probable. The thing is, in a series of 200 flips of a coin, it is highly probable that there will be a series of 6 heads (or tails) in a row. The students that cheated rarely had more than 3 of the same in a row.


Graphic by Kevin Brown, http://www.seanet.com/~ksbrown/index.htm

Benford's Law makes wonderful sense in situations of exponential growth.

Imagine you put $100.00 in a savings account that earns you 10% interest a year. At the end of the first year you have $110.00, at the end of the second you'll have $121.00, and the third will leave you with $133.10. The leading digit will remain a one until the eighth year (at which point you'll have $214.35). Two will be the leading digit for the next four years (at the end of which you'll have $313.84). Three more years will get you into the four hundreds (with $417.72), but you'll reach the five hundreds only two years after that. The more money you have in your account, the less time you'll spend with any particular leading digit. That is, until you've more then $1000.00 in your account. At this point it will again take you eight (or so) years to get to $2000.00, four to get to $3000.00, three to $4000.00, et cetera. A similar ratio from year to year will be present regardless of how high the interest rate is.

It is thus makes perfect sense that a "random" sampling of saving account balances will have about twice as many ones as their leading digits as twos, since the average account will spend almost twice as much time with a one as its leading digit. If we calculated the above for continuously compounded interest the numbers would match those predicted by Benford’s law even more closely.

Thus we expect things like the size of cities and the price of stocks to follow Benford’s Law, since both also grow exponentially. What’s freaky is how many unexpected things also follow the law; apparently logarithmic scales are more popular in nature and society then we might think.

Benford's law is a remarkable law of text to do with the occurrence and distribution of numbers in a body of text or set of data.

The law states that if you take a corpus of text and look at the number of occurrences of the first digit of every number in that text, the distribution is roughly exponential.

More precisely - for any substantial body of text, numbers beginning with the digit 1 will occur roughly twice as many times as numbers beginning with the digit 2. Which in turn occur roughly twice as many times as numbers starting with the digit 3. This continues all the way down to numbers beginning with the digit 9.

In most cases the difference in frequency between two consecutive digits is actually around 30% (not half), but this it is still significant enough such that numbers starting with the digit 9 only occur in around 5% of all instances.

At first this law seems very counter-intuitive. The natural assumption is that all digits should occur with around the same frequency and that their distribution should be even. The other remarkable thing about this law is that not only does it fit random corpus of text, but it can be applied to most types of more formal data, even that which is logical, controlled and sequenced. This includes data such as electricity bills, stock prices, lengths of rivers and even mathematical constants.

The reason for this law is the overwhelming occurrences of exponential and recursive systems in nature. One example is population growth, which can be modelled recursively and so total population fits an exponential curve. If we consider bacteria, each bacteria divides into two bacteria, which in turn divide into four, and again into eight - the population continues to grow faster and faster. The reason Benford's law is so prevalent is that these kind of systems pop up everywhere in nature, even when we don't expect them. They were noted as far back as 1881, when the American astronomer Simon Newcomb (sometimes noted for discovering Benford's law) noticed that in the pages of logarithm books, the earlier pages which contained numbers starting with 1 were more worn than the other pages.

With any system where numbers grow exponentially, due to the way our number system works, you will see that the system spends most of its time at a point where the current total begins with the digit 1 or 2. This might be hard to envision, but it becomes clear if you look at logarithmic scale. In a logarithmic scale (of base 10) the distance between 1 and 2 is equal to the distance between 10 and 20, and also slightly larger than the distance between 2 and 3 (or 20 and 30). As an exponential curve can be plotted to a logarithmic scale such that it creates an even distribution, this shows that for the corpus or data set to be valid, the data it is modelling must be exponential.

What I like about Benford's law is that it shows how prevalent exponential and recursive processes are in all aspects of nature. It shows clearly that linear systems happen far less than we naively expect. It also shows that the very intuitive notions we have of perceiving the world (such as the idea that the mean is always the fairest average), can often be flawed, skewed, or not really tell us what we think they do.

Log in or register to write something here or to contact authors.