Zipf’s law says that, in any large body of ordinary text in any language, the probability of a word occurring is inversely proportional to its frequency rank.
For example, suppose the most common word (frequency rank 1) is ‘the’, the second most common word is ‘and’ (frequency rank 2) and the third most common word is ‘of’ (so, frequency rank 3).
Suppose 9000 occurrences of ‘the’ are found in the text. Then we expect 4500 occurrences of ‘and’ and 3000 occurrences of ‘of’.
In practice any particular text is liable to deviate significantly from this, due to idiosyncrasies of style. Nevertheless, if we plot a graph of word rank vs the reciprocal of word counts, we can expect to find approximately a straight line, at least until we get down to low word counts, where the data gets lumpy and unpredictable.
Why Zipf’s law works, is a bit of a mystery. Some people say they know the reason, while others disagree.
To demonstrate Zipf’s law we’ll use Pandas and Matplotlib.
We’ll create the following functions:
- get_frequencies: this counts the words and puts the result in a dictionary.
- create_dataframe: puts the data into a Pandas dataframe and sorts it by word frequency rank
- main: uses the dataframe and Matplotlib to plot a graph of word rank vs. the reciprocal of word frequency, for the first n words, where n is some number we choose arbitrarily to get a nice graph.
The text I’ve used here is Charles Dickens’ A Christmas Carol.
import pandas as pd from collections import defaultdict import re import matplotlib.pyplot as plt """ Get word frequencies in a dictionary. """ def get_frequencies(file): # We use defaultdict so that when we try to # access a key that doesn't exist, a value # of zero is returned instead of an error freqs = defaultdict(int) with open(file, 'rt') as file: for line in file: # Split the line on space and punctuation. # This doesn't do a perfect job, but it's OK. words = re.split(r'[\s\!\,\.\?]+', line) # Add the words to the dictionary. # Make them lower-case so that the # same word with different casing # isn't counted twice. for word in words: if word: freqs[word.lower()] += 1 return freqs """ Put the words in a Pandas dataframe. """ def create_dataframe(freqs): words = freqs.keys() occurrences = freqs.values() df = pd.DataFrame() df['word'] = freqs.keys() df['occurrences'] = freqs.values() # Sort by word count, largest count first. df.sort_values(by='occurrences', ascending=False, inplace=True) # Re-index, so we can use the index as the word 'rank', # most common word having rank 0, second rank 1, third # rank 2, and so on. df.reset_index(inplace=True) return df def main(): freqs = get_frequencies('christmascarol.txt') df = create_dataframe(freqs) fig, ax = plt.subplots() fig.suptitle("Zipf's Law") ax.set_xlabel("Word frequency rank") ax.set_ylabel("Reciprocal of word count") # This controls how many rows to plot rows = 200 x = df.head(rows).index # Take the reciprocal of the word counts y = 1.0/df.head(rows)['occurrences'] ax.plot(x, y) plt.show() print(df) main()
The result is almost a straight line. Amazing!