Cute, but mathematically and statistically unsound. When you try to say
g(x,y) = xf*yf*T
you assume that g(x,y) is independent of both g(x) and g (y), that is, it assumes a web page author puts Y in a page regardless of whether he or she has put X in it or not. This simply isn't true! Most web pages are built to convey ideas, and words appear together in pages if there is a memetic connection between them.The stronger the connection, the more likely they are to appear together. A page which contains tectonic is much more likely to contain subduction than meringue.
So, lt's try the technique on some unrelated words:
g(aardvark) = 238,000
g(blunderbuss) = 16,800
g(carioca) = 323,000
g(aardvark,blunderbuss,carioca) = 29
T = 2.39e14
a factor of 100,000 over Google's claims. It might be interesting to notice that all 29 pages with all three words are word lists, and so the words are as "unrelated" as they can get. Of course, I'd have been delighted to find "A carioca paused to discharge a blunderbuss at a passing aardvark, but missed and resumed samba-ing down the street." in Rio Expresso!
So, you might think the results would be more accurate when words are more or less randomly distributed through webpages, or at least words without semantic content. We'll try the technique with the most common words in the English language:
g(the) = 2.89e9
g(a) = 1.77e9
g(an) = 384,000,000
g(of) = 1.92e9
g(the, a, an, of) = 10,900,000
T = 1.7e34
Not even the Defense Department has servers with that sort of capacity. So, something's really wrong here.
Now, to our original purpose: estimate the number of pages indexed by Google. Notice the results for "the": 2.89e9. It's reasonable to expect "the" to appear in every Web page written in English, and in quite a few not written in English. Assuming that the number of pages not written in English is less than the number of pages written in English, Google probably caches somewhere between 3 and 5 billion pages.