7
Physica A 390 (2011) 3157–3163 Contents lists available at ScienceDirect Physica A journal homepage: www.elsevier.com/locate/physa The role of entropy in word ranking Ali Mehri , Amir H. Darooneh Department of Physics, Zanjan University, P.O. Box 45196-313, Zanjan, Iran article info Article history: Received 26 October 2010 Received in revised form 14 February 2011 Available online 23 April 2011 Keywords: Entropy Information theory Complex systems Statistical mechanics Keyword extraction abstract Entropy as a measure of complexity in the systems has been applied for ranking the words in the human written texts. We introduce a novel approach to evaluate accuracy for retrieved indices. We also have an illustrative comparison between proposed entropic metrics and some other methods in extracting the keywords. It seems that, some of the discussed metrics apply similar features for word ranking in the text. This work recommend the entropy as a systematic measure in text mining. © 2011 Elsevier B.V. All rights reserved. 1. Introduction The concept of entropy has been introduced by Boltzmann for investigating statistical systems: S B = ln , where denotes the total number of microstates concern to a macrostate of many body system. Thereafter, Gibbs fulfilled Boltzmann’s idea and proposed a better formula for entropy [1]: S BG = i=1 p i ln p i ; here p i is the occurrence probability of ith state. Later, Shannon found an information (uncertainty) measure same as celebrated form of BG entropy in his works on information technology [2]. Intuitively, entropy measures amount of information or uncertainty in a random variable concerns to a natural process. Zero value for entropy represents certain outcome for a random variable. On the other hand, if all outcomes are equally likely (p i = 1/), entropy will have its maximal value. Note that S BG is extensive. This means that, if A and B are two independent systems, we will have S BG (A + B) = S BG (A) + S BG (B). BG entropy is the fundamental principle in standard statistical mechanics which has uncountable successes in explaining the behavior of ubiquitous systems. It is worth discussing about relative entropy. The relative entropy or the Kullback–Leibler (KL) divergence for two probability distributions P (x) and Q (x) of a random variable X is defined as S KL = xϵX P (x) ln(P (x)/Q (x)) [3]. This entropy is also called discrimination information, direct divergence and cross entropy in the literature. KL divergence compares the entropy of two distributions over the same random variable. In principle, the KL divergence gives the additional information required when encoding a random variable with another alternative distribution. In other words, KL divergence is the distance between two distributions or (interpreted as) a measure of similarity between probability distributions. Relative entropy is a non-symmetric measure of the difference between two probability distributions. KL measures the expected number of extra bits required to code samples from distribution P when using a code based on Q . Several universal features establish complexity of human written texts [4–6]. Statistical mechanics can be applied to study universal aspects of complex systems. The retrieval of useful information from textual data is an important field in both academia and industry. Statistical properties of words in the text structure can be applied for keyword extraction in text mining process. The key problem in the keyword extraction is ranking the words via their relevance according to text context. Different approaches have been presented for ranking the words in the literature [7]. Luhn applied Zipf’s law and some other people used standard deviation of distance between consecutive occurrences of a term [8–11]. Mihalcea and Tarau introduced a graph based method to classify words in the text [12]. Corresponding author. Tel.: +98 2413236646. E-mail addresses: [email protected], [email protected] (A. Mehri). 0378-4371/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.physa.2011.04.013

The role of entropy in word ranking

Embed Size (px)

Citation preview

Page 1: The role of entropy in word ranking

Physica A 390 (2011) 3157–3163

Contents lists available at ScienceDirect

Physica A

journal homepage: www.elsevier.com/locate/physa

The role of entropy in word rankingAli Mehri ∗, Amir H. DaroonehDepartment of Physics, Zanjan University, P.O. Box 45196-313, Zanjan, Iran

a r t i c l e i n f o

Article history:Received 26 October 2010Received in revised form 14 February 2011Available online 23 April 2011

Keywords:EntropyInformation theoryComplex systemsStatistical mechanicsKeyword extraction

a b s t r a c t

Entropy as a measure of complexity in the systems has been applied for ranking thewords in the human written texts. We introduce a novel approach to evaluate accuracyfor retrieved indices. We also have an illustrative comparison between proposed entropicmetrics and some other methods in extracting the keywords. It seems that, some of thediscussedmetrics apply similar features forword ranking in the text. Thiswork recommendthe entropy as a systematic measure in text mining.

© 2011 Elsevier B.V. All rights reserved.

1. Introduction

The concept of entropy has been introduced by Boltzmann for investigating statistical systems: SB = lnΩ , whereΩ denotes the total number of microstates concern to a macrostate of many body system. Thereafter, Gibbs fulfilledBoltzmann’s idea and proposed a better formula for entropy [1]: SBG =

∑Ω

i=1 pi ln pi; here pi is the occurrence probability ofith state. Later, Shannon found an information (uncertainty) measure same as celebrated form of BG entropy in his workson information technology [2]. Intuitively, entropy measures amount of information or uncertainty in a random variableconcerns to a natural process. Zero value for entropy represents certain outcome for a random variable. On the other hand,if all outcomes are equally likely (pi = 1/Ω), entropywill have itsmaximal value. Note that SBG is extensive. Thismeans that,if A and B are two independent systems, we will have SBG(A+B) = SBG(A)+ SBG(B). BG entropy is the fundamental principlein standard statistical mechanics which has uncountable successes in explaining the behavior of ubiquitous systems.

It is worth discussing about relative entropy. The relative entropy or the Kullback–Leibler (KL) divergence for twoprobability distributions P(x) and Q (x) of a random variable X is defined as SKL =

∑xϵX P(x) ln(P(x)/Q (x)) [3]. This entropy

is also called discrimination information, direct divergence and cross entropy in the literature. KL divergence compares theentropy of two distributions over the same random variable. In principle, the KL divergence gives the additional informationrequired when encoding a random variable with another alternative distribution. In other words, KL divergence is thedistance between two distributions or (interpreted as) a measure of similarity between probability distributions. Relativeentropy is a non-symmetric measure of the difference between two probability distributions. KL measures the expectednumber of extra bits required to code samples from distribution P when using a code based on Q .

Several universal features establish complexity of human written texts [4–6]. Statistical mechanics can be applied tostudy universal aspects of complex systems. The retrieval of useful information from textual data is an important field inboth academia and industry. Statistical properties of words in the text structure can be applied for keyword extraction intext mining process. The key problem in the keyword extraction is ranking the words via their relevance according to textcontext. Different approaches have been presented for ranking the words in the literature [7]. Luhn applied Zipf’s law andsome other people used standard deviation of distance between consecutive occurrences of a term [8–11]. Mihalcea andTarau introduced a graph based method to classify words in the text [12].

∗ Corresponding author. Tel.: +98 2413236646.E-mail addresses: [email protected], [email protected] (A. Mehri).

0378-4371/$ – see front matter© 2011 Elsevier B.V. All rights reserved.doi:10.1016/j.physa.2011.04.013

Page 2: The role of entropy in word ranking

3158 A. Mehri, A.H. Darooneh / Physica A 390 (2011) 3157–3163

Fig. 1. Schematic diagram of a text. Each cell represents position of a word in the text. The circles are the occurrence positions of a particular word type.All occurrence positions of a given word in the text with frequencyM are included in the set, t1, t2, . . . , tM . Distance between ith occurrence and its nextneighbor will be di = ti+1 − ti . The arrow shows correct movement direction from head of the text to its tail.

In thiswork, all entropicmethods for ranking thewords in the text are studied.We also compare results of thesemethodswith two other representative ranking methods, TextRank and standard deviation which are explained in the Appendix.

The paper is organized as follows. Section 2 is devoted to define some new definitions of entropy for words distributionin the text. A brief review of another form of entropy for words is also presented therein. In Section 3, we obtain accuracyof entropic metrics for ranking the words in both long and short representative texts. We also compare results of entropicmethods with two other well-known criteria in keyword extraction. Then relation between some entropic and nonentropicmetrics are illustrated. Ultimately, we report running time of discussed approaches.

2. Some entropy definitions for texts

Inspired by one dimensional bosonic system, we consider a text with length N as a system which is partitioned toN sections. Words are distributed within these N partitions in a particular routine to arise a specific concept. One canrepresented occurrence positions of a particular word type by t1, t2, . . . , tM . Here M refers to frequency of the consideredword type in the text. Distance between nearest neighbors of a word type occurrences can be written as di = ti+1 − ti. Headand tail of the text are connected like Fig. 1, So wewill have dM = tM − t1. We introduce several entropic metrics for rankingthe words in a text.

2.1. Entropy type A

If we assume D is sum of distances between occurrences of a term: D = d1 + d2 + · · · + dM ; It is clear that there areΩ = D!/d1!d2! . . . dM ! ways to distribute specific word in the text. One can define an entropy for this term such as

SA = lnΩ ≃ −

M−i=1

diD

lndiD

. (1)

Intuitively, we guess di/D is probability of existence a cycle with length di around the specific word type in the co-occurrence words network. We can generate a random text by random partitioning of the original text. For a random text,we have di = M/N . It can be easily shown that the entropy for random distribution of a word in the text has SranA = lnMform. So, a non-negative metric for ranking can be written as

EA = M1 −

SASranA

. (2)

We combine frequency and entropy in this metric. CoefficientM greaten the relevant words with higher frequencies.

2.2. Relative entropy type A

On the other hand, for comparison between distribution of a word in a humanwritten text and in a random shuffled textone can also calculate Kullback–Leibler divergence. As we explained in the introduction, here P = di/D and Q = 1/M , so

Page 3: The role of entropy in word ranking

A. Mehri, A.H. Darooneh / Physica A 390 (2011) 3157–3163 3159

KL relative entropy can be written as

REA =

M−i=1

diD

ln

diD

− ln1M

=

M−i=1

diD

lndiMD

. (3)

2.3. Entropy type B

We also can define probability for distances between consecutive occurrences of a word type in the text. Relativefrequency of a particular distance for a word type is considered as probability; p(d) = Md/M . Number of distances withlength d for a given word type is represented byMd. So, another form of entropy may be written as

EB = −

dmax−d=1

p(d) ln p(d). (4)

dmax stands formaximumdistance between consecutive occurrences of a consideredword type. The variables di and p(d)are not related because the value of distance between occurrences of a word type, di has not any relation with its probabilityp(d) in the text. It seems that short distances have higher probabilities in this summation only for relevant word typesbecause they like make clusters in the text.

2.4. Entropy type C

Another form of entropy for text has been formulated by Herrera and Pury [13]. In this approach the text with N words isdivided to C chapters which N =

∑Ci=1 Ni, where Ni is number of words in the ith chapter. IfMi(ω) andM(ω) are frequency

of a word type ω in the ith chapter and in the whole text, we will have M(ω) =∑C

i=1 Mi(ω). Relative frequency of a wordtype in the ith chapter can be written as fi(ω) = Mi(ω)/M(ω). Probability for the relative frequency has been define aspi(ω) = fi(ω)/

∑Cj=1 fi(ω). Therefore entropy will be

SC (ω) = −1

ln C

C−i=1

pi(ω) ln pi(ω). (5)

The normalized word ranking metric can has this form

EC (ω) =1 − SC (ω)

1 − SranC (ω)(6)

which SranC (ω) = 1 − (C − 1)/(2M(ω) ln C) is the alternative of SC for a word in random text.

3. Ranking the words by entropy

We use abovementioned approaches to extract indices for two representative texts without any lemmatization. Herreraand Pury’s paper [13] as a short text and the book; The Origin of Species, by Charles Darwin as a long text. We consider9 sections for the paper. Paper also has 5303 words and 1078 word types. We use a handy prepared keywords list with29 keywords as a benchmark to evaluate represented methods. For this purpose, we eliminate common words from title,keyword list and abstract of the paper. The remaining words are chosen as keywords. Darwin’s book has 16 chapters and192643 words. Its vocabulary consists of 8180 terms. Unfortunately original version of the book has not index or glossary;Therefore we use a prepared glossary with 283 keywords as a benchmark to evaluate accuracy of retrieved indices [13]. Firstof all, we extract ordered lists for the vocabulary words of the texts by means of definedmeasures. Then we should evaluateextracted indices. Recall, precision and F-measure are applied for evaluation. Each of these three criteria is important toevaluate various features. For example in the summarization process, recall is used as accuracy criterion. Typical formulasfor recall, precision and F-measure are Ref. [14].

R =Nrel

ret

Nrel(7)

P =Nrel

ret

Nret(8)

F =2RPR + P

. (9)

Nrel and Nret are number of words within the relevant and retrieved lists respectively and Nrel

ret stands for number ofwords within intersection set of relevant and retrieved lists.

Page 4: The role of entropy in word ranking

3160 A. Mehri, A.H. Darooneh / Physica A 390 (2011) 3157–3163

Table 1Recall, precision and F-measure for extractedindices by EA , REA , EB , EC , TextRank and standarddeviation for long and short documents. Dar-win’s book is specified by letter D and Herreraand Pury’s paper marked by HP. EB has the bestvalue of precision and EC has maximum valuesof recall and F-measure in the long document.In the short document, both REA and EC havehighest values for recall and F-measure. High-est value of precision in the short document be-longs to TextRank.

Metric Text RHP PHP F

EA D 0.300 0.080 0.126HP 0.138 0.028 0.046

REA D 0.276 0.066 0.107HP 0.172 0.028 0.049

EB D 0.156 0.141 0.148HP 0.104 0.028 0.044

EC D 0.417 0.102 0.164HP 0.172 0.028 0.049

TR D 0.152 0.126 0.138HP 0.104 0.030 0.046

SD D 0.371 0.069 0.116HP 0.138 0.028 0.047

Herrera and Pury have other interpretations for recall and precision in their work [13].

RHP =Nc

Nrel(10)

PHP =Nrel

Nl. (11)

Nrel is total number of relevant words in the prepared benchmark glossaries. Here Nrel will be 29 and 283 for paper andbook respectively. Nc stands for number of relevant words among the first Nrel entries of ordered list of words extracted bya metric and last position of this list in which appears a relevant word is represented by Nl.

4. Comparison between word ranking methods

We calculate recall, precision and F-measure by the HP method. Table 1 contains these parameters for assigned metricsapplied in the representative long and short documents. In the long textwe have 0.152 ⩽ RHP ⩽ 0.417, 0.066 ⩽ PHP ⩽ 0.141and 0.107 ⩽ F ⩽ 0.164. Maximum values of recall and F-measure belong to EC and EB has the best value for precision in thelong document. On the other hand, the boundary values for short document are 0.104 ⩽ RHP ⩽ 0.172, 0.028 ⩽ PHP ⩽ 0.030.EC and 0.044 ⩽ F ⩽ 0.049. Both REA and EC have best values for recall and F-measure. TextRank has maximum precisionvalue for the short text.

It seems that Eq. (11) is not proper formula for obtaining precision for an extracted index because it strongly dependson position of last relevant word in the ordered words list. This means that, we will have a bad precision value for a goodretrieved indexwhich hasmost of relevantwords in its headmost positionswith only a few relevantwords in its bottommostplaces. Therefore, we use Eqs. (8) and (9) for recall and precision.

We also introduce a new approach to evaluate extracted indices. In this method, different numbers of words are selectedfrom beginning of the ordered list as retrieved indices to obtain recall, precision and F-measure. Then we plot theseparameters versus various fractions of retrieved index. Fig. 2 illustrates recall, precision and F-measure plots for EA as arepresentative measure. These three parameters have similar behaviors for the other metrics.

As seen in the precision graph, most of the relevant words are located within the first 20% entries of the ordered list.Precision has almost smooth behavior after 0.2 to find the position of the last keyword. So, slope of the precision graphbefore r = 0.2 can be used in the evaluation process. The best method will have highest slope for precision graph. On theother hand, F-measure has a globalmaximum.Maximumof F-measure in Fig. 2 happens at r = 0.067 for EA. Thismeans that,the best choice for retrieved index is a list includes first 6.7% of entries in ordered list by EA. In other words, the optimumlength for retrieved index determined by maximum point of the F-measure graph.

Page 5: The role of entropy in word ranking

A. Mehri, A.H. Darooneh / Physica A 390 (2011) 3157–3163 3161

Fig. 2. Recall (blue diamonds), precision (red squares) and F-measure (green circles) as functions of fractions of retrieved list of words. These plots are forordered list extracted by EA applied for the long document. Other metrics have same plots. Horizontal axis represents fractions of the ranking list whichare selected from its beginning as retrieved indices. There is a global maximum in F-measure plot at r = 0.067. Therefore, we should select first 6.7% ofentries for retrieved index of the document by metric EA . (For interpretation of the references to colour in this figure legend, the reader is referred to theweb version of this article.)

Table 2Maximum value of F-measure, its position for EA , REA , EB , EC , TextRank and standard deviation for Darwin’s book and Herrera’s paper. Values of recall andprecision in maximum position of F-measure are presented too.

Metric Text Fmax r R P

EA D 0.368 0.067 0.279 0.541HP 0.214 0.043 0.174 0.276

REA D 0.341 0.081 0.243 0.569HP 0.190 0.022 0.212 0.172

EB D 0.285 0.074 0.209 0.445HP 0.120 0.237 0.067 0.586

EC D 0.432 0.046 0.380 0.502HP 0.236 0.083 0.156 0.483

TR D 0.289 0.067 0.219 0.424HP 0.118 0.178 0.068 0.448

SD D 0.412 0.081 0.294 0.689HP 0.166 0.029 0.159 0.172

Table 2 includes maximum values of F-measure and their positions for mentioned methods which applied in therepresentative long and short documents. We also report recall and precision values in maximum position of F-measurein this table. Maximum and minimum values of quantities for long text are 0.285 ⩽ Fmax ⩽ 0.432, 0.046 ⩽ r ⩽ 0.081,0.209 ⩽ R ⩽ 0.380 and 0.424 ⩽ P ⩽ 0.689. In the long text, the best value for F-measure, the shortest index size andmaximum value of recall for optimum size of the retrieved index produced by EC . But the highest value of precision foroptimum size of the retrieved index in the long document belongs to standard deviation. On the other hand, boundaryvalues for short text are 0.118 ⩽ Fmax ⩽ 0.236, 0.022 ⩽ r ⩽ 0.237, 0.067 ⩽ R ⩽ 0.212 and 0.172 ⩽ P ⩽ 0.586. Inthe short text, the shortest retrieved index and maximum value of recall are produced by REA. But EB and EC have the bestvalues for precision and Fmax respectively. This table generally shows that entropic metrics work better than nonentropics.We also check the effect of functional words elimination on the ranking results. If the functional words are excluded theresults of some methods become a little bit better. But the final reported results in the tables do not change. Running timeis an important factor in computational methods. A faster method is better choice than slower one. We compare run timeof discussed methods for various fraction of representative texts. Fig. 3 illustrates running time of methods as a functionof text fraction. TextRank and EB are slower than other methods. The inset diagram in Fig. 3 confirms that EA, REA, EC andstandard deviation have almost same running time.

5. Relations between methods

We also get significant relation between some of discussed criteria. It seems that EA and TextRank use same features forranking the words. In the other words, we guess these two metrics apply cycle length around terms in the co-occurrencewords network. Therefore, we expect their actions are similar. Fig. 4 confirms this conjecture. It is clear that an exact relationis seen between these two metrics.

Page 6: The role of entropy in word ranking

3162 A. Mehri, A.H. Darooneh / Physica A 390 (2011) 3157–3163

Fig. 3. Running time of discussed metrics versus different volume fractions of book The Origin of Species. It is clear that TextRank and EB are slower thanothers. Inset: running time graphs just for EA , REA , EC and standard deviation. (For interpretation of the references to colour in this figure legend, the readeris referred to the web version of this article.)

Fig. 4. TextRank plot versus EA for Darwin’s book (blue diamonds) and Herrera and Pury’s paper (red circles). It is seen that these two metrics have sameincreasing behaviors in theword ranking process. This relationmay depend on existence of similar procedures in these twomethods for ranking thewords.

Fig. 5. Standard deviation of distances between word type occurrences as a function of REA for Darwin’s book (blue diamonds) as a long text and Herreraand Pury’s paper as a short text (red squares). Linear behavior is an interesting phenomenon. (For interpretation of the references to colour in this figurelegend, the reader is referred to the web version of this article.)

There is another relation between standard deviation and REA. This fact is illustrated in Fig. 5. A linear behavior is seenhere.

6. Conclusion

In conclusion, in this article we represent the entropic metrics for ranking the words in humanwritten texts without anylemmatization. Ranking has the basic role in keyword extraction. We also addressed a new proposal to evaluate accuracyfor retrieved indices. In this way, We calculate recall, precision, F-measure and the optimum size of the retrieved indices. Acomparison between entropic metrics and two other representative measures has illustrative results. This study shows thatour new forms of entropicmeasures have reliable results inword ranking process. Often EA, REA, EB and EC as entropicmetricshave better results for evaluation parameters than the other nonentropic measures; TextRank and standard deviation.Consequently, here we study entropy as a systematic measure in text mining process.

Page 7: The role of entropy in word ranking

A. Mehri, A.H. Darooneh / Physica A 390 (2011) 3157–3163 3163

Acknowledgments

We thank P.A. Pury for his generous collaboration in disposal their text file of Darwin’s book; The Origin of Species, andtheir prepared glossary of this book. Thiswork has been supported by the Zanjan university research programon;A combinedmethod for keyword extraction, No. P/6295.

Appendix. Nonentropic word ranking methods

A.1. Standard deviation

Standard deviation of distances between nearest neighbors of a same term is one of the first methods are represented forranking the text words [9]. If we denote consecutive occurrences distance and its average with d and µ = ⟨d⟩ respectively,

standard deviation of normalized distance between occurrences will be SD =

s2

− ⟨s⟩2. Where s = d/µ is normalized

distance between consecutive occurrences.We also normalized this quantity bymeans of standard deviation of ideal randomtext which has geometric probability distribution SDgeo =

√1 − p [13].

SDnor =SD

SDgeo. (12)

A.2. TextRank

Each text can be represented by a graph. Text word typeswill be graph nodes connecting the co-occurrences byweightedor unweighted links. Adjacency matrix of text graph (A) includes all information about the text.

Aij =

1 if i and j are adjacent0 otherwise.

Eigen vector of adjacency matrix determines centrality of a node (word type). Mehalcea and Tarau introduced TextRankas a generalized version of centrality for ranking the words in a given text [12].

TR(ωi) = (1 − d) + d−

ωj∈In(ωi)

TR(ωj)

|Out(ωj)|. (13)

TR(ωi) denotes the value of TextRank for the ith node. In(ωi) (Out(ωi)) is the set of nodes that point to (are pointed by)ωiand |Out(ωi)| is the outdegree of the ith node. 0 ≤ d ≤ 1 is damping parameter that usually set at 0.85. The above equationcan be solved via an iterative routine like as Gauss–Seidel method.

References

[1] G.W. Gibbs, Elementary Principle in Statistical Mechanics, Longmans Green and Company, New York, 1928.[2] C.E. Shannon, Bell Syst. Tech. J. 27 (1948) 379.[3] M. Mezard, A. Montanari, Information, Physics and Computation, Oxford University Press, Oxford, 2009.[4] G.K. Zipf, Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology, Addison-Wesley, Cambridge, 1949.[5] R. Ferrer i Cancho, R.V. Solé, Proc. R. Soc. Lond. B 268 (2001) 2261.[6] S.N. Dorogovtsev, J.F.F. Mendes, Proc. R. Soc. Lond. B 268 (2001) 2603.[7] M.W. Berry, J. Kogan, Text Mining Applications and Theory, Wiley, New York, 2010.[8] H.P. Luhn, IBM J. Res. Dev. 2 (1958) 159.[9] M. Ortuño, et al., Europhys. Lett. 57 (2002) 759.

[10] H. Zhou, G.W. Slater, Physica A 329 (2003) 309.[11] P. Carpena, et al., Phys. Rev. E 79 (2009) 035102(R).[12] R. Mihalcea, in: A. Gelbukh (Ed.), RandomWalks on Text Structures, in: LNCS, CICLing, Springer, Heidelberg, 2006, pp. 249–262.[13] J.P. Herrera, P.A. Pury, Eur. Phys. J. B 63 (2008) 135.[14] D.L. Olson, D. Delen, Advanced Data Mining Techniques, Springer-Verlag, Berlin, 2008.