Internet Search Result Probabilities,
Heaps' Law and Word Associativity
 
 
 
Home
Pastimes
Talents
Photos
Publications
My Links
Publications 

Jonathan C. Lansey, Bruce Bukiet. Journal of Quantitative Linguistics, January 2009, Volume 16, Number 1, pp. 40–66


The full version of the paper in pdf form is here.

The data for this topic is in excel format here: LanseyGoogleData.xls

And in hyperlinked format here: LanseyGoogleData.html

The old poster for this paper is here.

Feel free to e-mail me any comments and or questions
 


My Google Tech Talk:
Googlewhacks for Fun and Profit

ABSTRACT:

We study the number of internet search results returned from multi-word queries
based on the number of results returned when each word is searched for individually.
We derive a model to describe search result values for multi-word queries using the
total number of pages indexed by Google and by applying the Zipf power law to the
words per page distribution on the internet and Heaps’ law for unique word counts.
Based on data from 351 word pairs each with exactly one hit when searched for
together, and a Zipf law coefficient determined in other studies, we approximate the
Heaps’ law coefficient for the indexed worldwide web (about 8 billion pages) to be
beta=0.52. Previous studies used under 20,000 pages. We demonstrate through examples
how the model can be used to analyse automatically the relatedness of word pairs
assigning each a value we call ‘‘strength of associativity’’. We demonstrate the validity
of our method with word triplets and through two experiments conducted 8 months
apart. We then use our model to compare the index sizes of competing search giants
Yahoo and Google.


     

Always feel free to e-mail me comments and questions (both technical and not)
Copyright, © Jonathan Lansey