Lansey's Alternate Resume :: Googlewhack and Internet Search Result Probability

Internet Search Result Probabilities,
Heaps' Law and Word Associativity

Home

Publications

Jonathan C. Lansey, Bruce Bukiet. Journal of Quantitative Linguistics, January 2009, Volume 16, Number 1, pp. 40–66

The full version of the paper in pdf form is here.

The data for this topic is in excel format here: LanseyGoogleData.xls

And in hyperlinked format here: LanseyGoogleData.html

The old poster for this paper is here.

Feel free to e-mail me any comments and or questions

My Google Tech Talk:
Googlewhacks for Fun and Profit

ABSTRACT:

We study the number of internet search results returned from multi-word queries
based on the number of results returned when each word is searched for individually.
We derive a model to describe search result values for multi-word queries using the
total number of pages indexed by Google and by applying the Zipf power law to the
words per page distribution on the internet and Heaps’ law for unique word counts.
Based on data from 351 word pairs each with exactly one hit when searched for
together, and a Zipf law coefficient determined in other studies, we approximate the
Heaps’ law coefficient for the indexed worldwide web (about 8 billion pages) to be
beta=0.52. Previous studies used under 20,000 pages. We demonstrate through examples
how the model can be used to analyse automatically the relatedness of word pairs
assigning each a value we call ‘‘strength of associativity’’. We demonstrate the validity
of our method with word triplets and through two experiments conducted 8 months
apart. We then use our model to compare the index sizes of competing search giants
Yahoo and Google.