Zipf`s Law

Document technical information

Format pdf
Size 167.6 kB
First found May 22, 2018

Document content analysis

Category Also themed
Language
English
Type
not defined
Concepts
no text concepts found

Persons

Organizations

Places

Transcript

Zipf 's Law
Viswanath Poosala
900 839 0750
[email protected]
Abstract
Many naturally occurring phenomena exhibit regularity to some extent. G. K. Zipf
noticed that subjects as diverse as income distribution, word frequency and generaspecies distributions exhibited a common regularity. A common empirical regularity
suggests some universal principle behind the phenomena. He proposed a law to model
this behavior, which is essentially an algebraically decaying function describing the
probability distribution. This report describes various formulations of the law and
concentrates on a few attempts by statisticians and linguists to model the causality
behind such phenomena which would explain when and why Zipf's law should hold. It
also attempts to consolidate various phenomena which have been empirically proven
to obey Zipf's law. Finally, verication of the law using some real-life information
reinforces its validity.
1 Introduction
Nature is full of phenomena which seem to obey a few laws. Some, such as falling apples, can
be explained satisfactorily based on certain laws of physics or other mathematical sciences.
On the other hand, there are some events, such as those occurring in chaotic systems, which
do not exhibit any apparent regularity. There is another type of phenomena which exhibit
empirically observable regularities, but do not directly yield to explanation using simple
laws of nature. Once the laws are established, (which can be physical, mathematical or
statistical etc.) these phenomena transgress to the rst class. This scientic process usually
proceeds by proposing a small set of laws which can model the phenomena and establishing
the necessary and sucient conditions for the occurrence of this apparent regularity.
Whenever such apparently regular phenomena are observed over large samples of data,
it becomes an important task for the statisticians to detect the statistical properties of the
system which cause such regularities. As an example, if we collect data regarding the relative
frequency of getting r heads in n tosses of an unbiased coin, after many such experiments, a
pattern will emerge for the relative frequencies which depends on r and n. This statistical
phenomenon can be explained once it is established that each toss of a coin is a Bernaulli
experiment and n such tosses will follow binomial distribution Bin(n, 0.5). Then the probability of tossing exactly r heads in n attempts is simply (nCr)(0:5)n . This is an example of a
regular phenomena occurring in nature which can be satisfactorily explained using statistical
principles.
This report deals with another such empirical phenomenon which has been observed in
elds as diverse as population distribution, word usage and biological genera and species.
G. K. Zipf rst proposed a law (named Zipf's law) which he observed to be approximately
obeyed in many of these domains [Zipf 49]. This ubiquitous empirical regularity suggests the
presence of a universal principle. This report mainly concentrates on various formulations of
the law and describes a few attempts at statistically explaining its theoretical underpinnings.
In particular, the work relating to frequency of usage of words is presented in most detail.
A more practical goal of this project is to consolidate various cases in which Zipf's law has
been empirically shown to hold. This could be valuable, for example, in designing databases
involving certain statistical assumptions about the distribution of the underlying data. As
an independent verication of the law, some real data distributions were also investigated.
The report is organized as follows. In section 2, we present the various formulations of
Zipf's law. In section 3, we present a detailed overview of various approaches to explain the
theoretical foundation of the law. Next, we brifely state the assumptions made about the
underlying system's behavior by one of the derivations for Zipf's law to hold good. After
that, we present a set of phenomena obeying the Zipf's law and the results of investigating
2
data distribution in a real life database (NBA statitistics), which essentially veried the
Zipf's law. Finally, we summarize the conclusions drawn from this survey.
2 Formulation of Zipf's Law
Dierent versions of Zipf's law exist which vary in their generality. As explained below, the
simplest form of Zipf's law has been criticized on multiple aspects and later generalized to
a more complex form.
2.1 Simple form of Zipf's law
Consider a set of data values, ranked by their value such that
x >= x >= ::xn,
r being the rank of xr in this order. xr can be thought of as the size of the r'th data value
in the ordered set. Zipf's law is a relation between the rank of a data value and its actual
value which has been empirically noted to be as follows (in a non-general form):
1
2
rxr = constant
(1)
Zipf and others veried that this law holds for various kinds of domains as listed in the
introduction and in chapter 5. This rank-size relation is known as Zipf's law and its graph
is a rectangular hyperbola (g 1).
100
90
80
70
Size
60
50
40
30
20
10
0
0
20
40
60
80
100
Rank
Figure 1: Size vs Rank for the simple version of Zipf's law
size of an object and f (x) be its relative frequency of occurrence, where
R 1 fLet(x)xdxbe= the
1. If n is the number of objects in the data set and N (x) the number of objects
0
3
with size greater than x then,
Z1
nf (u)du
(2)
is the rank of the object of size x. Under Zipf's law (1), xN (x) = K (some constant). Hence,
N (x) =
x
f (x) = N 0(x)=n = K 0=x
2
(3)
where K 0 = K=n. Equation (3) is the size-frequency relation corresponding to (1). G. K.
Zipf attempted to explain the origins of the law in the nature of human behavior, through
the principle of least eort.
The above formulation has the following deciencies:
1. Zipf's explanation [Zipf 49] in terms of human behaviour does not explain the
underlying statistical process.
2. The value of the constant K 0 in (3) depends on the number of objects n.
3. As discussed below, a statistical explanation for the phenomena observed by Zipf
leads to a family of distributions and (3) is just a special case of them.
2.2 Generalized Zipf's law
As explained in the above section, a main drawback of the Zipf's law is that the phenomena
observed by Zipf and justied by statistical rationale lead to a family of distributions, namely,
raxr = constant; a > 0
(4)
After some analysis, it leads to the following size-frequency relation:
f (r) = Ar
where a > 0, and
a
(1+ )
; r = 1; 2; :::;
A = (1 + a) =
1
X
r
r=1
a
(1+ )
(5)
(6)
is the zeta function. The above equation denes the discrete Pareto distribution [John 69],
which includes (3) as a special case when a = 1.
Figure 2 plots the above rank-frequency function for various values of a ranging from 0
to 4 in intervals of 0:5. Note that when a = 0, the distribution is uniform and as a increases,
the skew of the function increases.
4
1000
a:0 to 5
Frequency
800
600
400
200
0
5
10
Rank
15
20
Figure 2: Frequency vs Rank for the general version of Zipf's law
3 Theoretical foundation of Zipf's law
There are at least three major schools of thought on the theoretical underpinnings of Zipf's
law. Hill and Woodroofe [Hill 75, Wood 75] and others show that Zipf's law can be derived
from various stochastic processes, including Bose-Einstein model and Fisher's logarithmic series distribution. Another approach, proposed by [Price 76], has been to manipulate classical
occupancy models to yield hyperbolic distributions. The third approach due to MandelBrot
[Mandelbrot 53] takes an information theoretic approach to studying the statistical structure of language, thus leading to the Zipf's law. The following subsection deals with Price's
derivation in detail. After that, a brief description of Mandelbrot's approach is given.
3.1 Cumulative Advantage Distribution (Price's approach)
Price [Price 76] presents the cumulative advantage distribution, which can be derived as a
stochastic birth process. Consider a population of nT individuals. Let ri be the total number
of \successes" achieved by the i'th individual. Let f (r) be the fraction of individuals with r
successes,
P1 f (r ) = 1:
The mean number of previous successes is,
R = P1 rf (r):
An individual with r successes is considered to be in state r. Transitions occur only by the
incidence of a further success on an individual, which transforms the individual from state
r to r + 1. Note that this is a strictly birth-only process because no transitions occur in the
reverse direction.
1
1
5
Now, suppose that a small number (dnT ) of new individuals are added to the populations. And with them RdnT new successes are also sprinkled evenly at random over all the
population. Note that the total number of successes is proportional to the population size.
Hence, the number of new successes per single previous success is,
dnT =nT :
(7)
By denition, there are f (r)nT individuals in state r (before sprinkling). Hence, the
number of previous successes in state r is rf (r)nT . By equation (7), the number of new
successes sprinkled in the state r is,
rf (r)dnT :
(8)
This is also the number of individuals transferring from state r to state r + 1. Similarly, the
number of individuals transferring from state r 1 to state r is,
(r 1)f (r 1)dnT :
(9)
From equations 8 and 9, it follows that:
d n f (r ) =
dnT T
=
rf (r) + (r 1)f (r 1);
f (1) + 1;
r = 1:
r>1
Hence it follows:
nT dnd f (r) =
T
=
(r 1)f (r) + (r 1)f (r 1);
2f (1) + 1;
r = 1:
r>1
The distribution over the states is dened by this series of dierence equations. It can be
seen that for a stable distribution, for which f (r) is independent of nT , the left-hand side of
the above equation becomes zero and solving recursively:
f (r) = rr + 11 f (r 1)
= rr + 11 : r r 2 ::: 13 : 21
= r(r 1+ 1) :
This is a special and important form of the Zipf relationship [Ijiri 77].
6
3.2 Mandelbrot's derivation
Mandelbrot's work in information theory and linguistics led to another derivation of Zipf's
law. He assumed that the aim of language is to transmit the most information per symbol, in
the information theoretical sense of Shannon, with the least eort. He obtained the following
relationship in [Mandelbrot 53],
f (r) = K (r + c) ;
(10)
where f (r) is the word frequency and r is the rank of the word. The constant c improves
the t for small r and the exponent improves the r for large r. Through a series of substitutions into a more complex argument of Mandelbrot [Mandelbrot 57] Booth demonstrated
([Booth 67]) that Zipf's law and Mandelbrot's revision are equivalent.
3.3 Simon's approach
Simon [Simon 55] expanded on Zipf's work by describing a set of empirically derived skew
distribution functions. His model is also presented in terms of word frequencies. He shows
that the distribution of words in a text behaves according to the following equation:
1
X
f (r ) = 1 ;
f (r) = A (r; + 1);
r=1
(11)
where A and are constants and (r; + 1) is the beta function of r and + 1 given by:
(r; + 1) =
Z
1
r (1 )d
(r) ( + 1) :
(r + + 1)
(
1)
0
=
As r goes to innity and for any constant + 1,
(r ) ! r (12)
(r + + 1)
So, from the above equation,
f (r) = A ( + 1)r (13)
This is equivalent to Zipf's law, f (r) = c=r , for equal to + 1 and c equal to A ().
( +1)
( +1)
3.4 Rationale behind Zipf's law
In the previous section we presented some of the statistical derivations of the Zipf's law, with
tacit assumptions about the properties of the underlying system. In this section, we briey
7
summarize some of the assumptions made about the system being studied. For the sake of
brevity, the system being studied is limited to the usage frequency of words in literature.
These assumptions are used in Simon's derivation presented above.
It has been observed [Simon 55] that the stochastic process by which words are chosen
to be included in written text follows two steps:
By processes of association, i.e., sampling earlier segments of his/her word sequences.
By imitation, i.e., sampling from other works by self or other authors.
The assumptions made in deriving Simon's formula (11) are:
1. The probability that the (T + 1)st word has appeared exactly r times is proportional to rf (r; T ), that is, to the the total number of occurrences of all words that
have appeared exactly r times.
2. For large T , there is a constant probability ! that the (T + 1)st word is a new
word (hasn't appeared in the rst T words).
Words chosen by association can only be the results of assumption 1 whereas words chosen
by imitation can also be new. Note that these assumptions are quite valid in practice.
By assigning probabilities for imitation and association, these assumptions can be shown
to lead to (13). The derivation is not complicated but is too long to be included here. It is
presented in [Fedo 81].
4 Domains in which Zipf's law holds
Often in the design of various systems some assumptions need to be made about the underlying domain. For most systems dealing with non-deterministic data, even small amount
of correct knowledge of the underlying data distribution can be highly benecial. Various
researchers have analyzed dierent data domains and identied a few of them which empirically obey the Zipf's law. It is the aim of this chapter to consolidate some of this work and
present the domains in which the law has been veried to hold. The table below makes it
apparent that the Zipf's law holds on vastly diverse domains and phenomena which do not
have any relation.
8
Domain
Examples
Bibliography Frequency of occurrence of words in an article
Number of publications of authors
Books by number of pages
Citation Frequency of an article
Geography Length of a rugged coastline
Cities by population
Biology
Genera by number of species
Computer Sci Distribution of data in a database
People
Income distribution of employees in a rm
First letters of people's names
Last names of people
5 Verication of Zipf's law on real distributions
So far in the report we have concentrated on the theoretical verication of Zipf's law and
also some of the documented observations. In order to independently verify if the law holds
for some of the real data distributions, a database of statistics of some NBA basketball
players for the years 1991-92 was obtained [NBA 92]. These statistics include the number of
goals scored by a player in a season, number of blocked shots etc. To verify Zipf's law on the
number of shots blocked, we obtained the frequencies of various number of such shots. i.e, for
a given number of blocked shots n, the number of players with n such shots (fn ) is obtained
and plotted. If Zipf's law holds, we expect the distribution to be roughly hyperbolic, similar
to one of the curves in gure 2. The frequency distribution is plotted in gure 3, for number
of blocked shots.
As can be observed from the above gure, the distribution is roughly hyperbolic. It was
also veried that many of the statistics yielded similar graphs, thus empirically verifying
Zipf's law for a real life database.
6 Conclusions
In this report, we presented various formulations of the Zipf's law and concentrated on some
of its theoretical derivations. The multiple derivations leading to the same law strongly
hint at the universality of the principle. We also presented the assumptions that must hold
good in the underlying systems for one of the derivations to be valid. The naturality of
assumptions reinforces validity of the derivation. We have also presented a set of diverse
9
NBA (blocked shots)
70
60
Frequency
50
40
30
20
10
0
0
20
40
60
80
100
Rank
Figure 3: Frequency vs Rank for NBA statistics
domains and phenomena where Zipf's law had been empirically veried to hold. The vast
disparity between these domains speaks for the generality of Zipf's law. Finally we observed
that Zipf's law holds for a real-life database as well.
|||||||||||||||||||||||||||||||||||||||{
Postscript (or how the report literally veried Zipf's law)
This report dealt with Zipf's law mostly in the domain of word usage, and is probably
incomplete without verifying the law on the report itself. So, as an afterthought, I measured
the frequency of words used in this report (ignoring case and excluding the postscript) and
the resulting plot is shown in gure 4. Clearly, the distribution is very close to being a highly
skewed Zipf-ian distribution (compare with the curves in gure (2)). Not suprisingly, \zipf"
is among the most frequent words, used 61 times.
Frequency vs Rank
200
Frequency
150
100
50
0
0
100
200
300
400
Rank
500
600
Figure 4: Frequency vs Rank for word usage in this report
10
References
[Booth 67] A. D. Booth, "A law of occurrences for words of low frequency", Information
and control, 10(4), April 1967.
[Chen 80]
Wen-Chen Chen,"On the weak form of Zipf's law", J. Applied Probability, 17,
1980.
[Fedo 81]
Jane Fedorowicz, "The theoretical foundations of Zipf's law and its application
to the bibliographic database environment", Journal of the american society
for information science.
[Hill 74]
B. Hill, "Rank frequency forms of Zipf's law", Journal of the american statistical association, 1974.
[Hill 75]
Bruce Hill, Michael Woodroofe, "Stronger forms of Zipf's law", Journal of
American Statistical Association, 1975.
[Ijiri 77]
Y. Ijiri, H. A. Simon, "Skew distribution functions and the size of business
rms", New Yotk, North Holland, 1977.
[John 69]
N. L. Johnson, S. Kotz, "Distributions in statistics: Discrete distributions",
Vol. 1, Wiley, New York.
[Mandelbrot 53] B. Mandelbrot, "An information theory of the statistical structure of language", Proc. symposium on applications of communication theory, Sept 152.
[Mandelbrot 57] B. Mandelbrot, "Theorie mathematique de la loi d'estoupZipf", Paris, Insitut de statistique de l'universite, 1957.
[Price 76]
Price, D. De Solla, "A general theory of bibliometric and other cumulative
advantage processes", Journal of American Statistical Association, 1976.
[NBA 92]
\nba9192.txt", Internet (ftp: olympos.cs.umd.edu, Univ of Maryland).
[Simon 55] H. A. Simon, "On a class of skew distribution functions", Biometrika, 42, 1955.
[Wood 75]
Michael Woodroofe, Bruce Hill, "On Zipf's law", J. Applied Probability, 12,
1975.
[Zipf 49]
G. K. Zipf, "Human behavior and the principle of least eort", 1949, AddisonWesley, Reading MA.
11

Similar documents

×

Report this document