profile pic
⌘ '
raccourcis clavier

Applies to frequency table of word in corpus of language:

word frequency1word rank\text{word frequency} \propto \frac{1}{\text{word rank}}

Empirically:

  • the most common word occurs approximately twice as often as the next common one, three times as often as the third most common, and so on.

also known in Zipf-Mandelbrot’s law:

frequency1(rank+b)aa,b:fitted parameters with a1 and b2.7\begin{aligned} \text{frequency} &\propto \frac{1}{(\text{rank} + b)^a} \\[8pt] &\because a,b: \text{fitted parameters with } a \approx 1 \text{ and } b \approx 2.7 \end{aligned}

definition

Zipf distribution

the distribution on NN elements assign to element of rank kk (counting from 1) the probability:

f(k;N)={1HN1k,if 1kN,0,if k<1 or N<k.HNk=1N1k.(normalisation constant)\begin{aligned} f(k;N) &= \begin{cases} \frac{1}{H_N} \frac{1}{k}, & \text{if } 1 \leq k \leq N, \\ 0, & \text{if } k < 1 \text{ or } N < k. \end{cases} \\[12pt] &\because H_N \equiv \sum_{k=1}^{N} \frac{1}{k}. (\text{normalisation constant}) \end{aligned}