[IEEE 2009 International Conference on Information Engineering and Computer Science - Wuhan, China (2009.12.19-2009.12.20)] 2009 International Conference on Information Engineering

New Method of Character Relations Auto-extraction

Feng Yang, Luo Sen-Lin, Pan Li-Min, Wang Yue, Wang Yao-Wei Lab of Information Security & Countermeasures Technology

School of Information & Electronics of BIT Beijing, P. R. China

[email protected], [email protected]

Abstract—A new method of character relations auto-extraction from Chinese text set is proposed in this paper. In this method, characters and their relations are extracted from sentences and documents by computing the relative strength and attributive weight of relative properties. Each dispersed relation is organized and represented by the relative coefficient. For the purpose of exhibiting relations visually, the relations map, which describes characters and relative properties between them, could be constructed based on the coefficients. The experiment of relation extraction shows that the precision and recall of this method are 85.56% and 61.94%; the experiment of relations map shows that the precision and recall are 94.74% and 52.94% of level-1, 96.88% and 59.62% of level-2, respectively.

Keywords- Chinese Information Processing; character relation; relative strength; relative property; relations map

I. INTRODUCTION

As the carrier which is used to log human’s social activities, an important function of text is to describe character relations. However, those relations are dispersed in texts. It is very difficult to extract them manually. Hence a method must be proposed to extract them from text set automatically, quickly and exactly. This method can be used to answer some questions like “Who is Liu Xiang’s Coach?”, “Who is Gorge Bush’s wife?” and “If Yao Ming and Mc Grady are competitors or teammates?”

Some useful and effective methods have been proposed for relation recognizing and extracting in past research. In literature [1], two machine learning algorithms, window and SVM are used for extracting entity relation from sentence; literature [2] introduces positive and negative SVM training into the entity extraction; literature [3] proposes an entity extraction method based on seed self-expansion; in literature [4], the technology of word semantic match is imported into the pattern match to extract Chinese entity relation etc. These methods are aiming at entity relation in the sentence, and they need the templates which are trained by some tagged data. Literature [5] proposes a new method, which uses simulated annealing method, to mind character relations from web pages. Besides these, the technology of Object-level Vertical Search[6][7][8] is also a kind of relation extraction method. According to the theory, all entities in the web pages are defined as objects. Some correlative information from titles, sentences, tables, description and so on, should be extracted for constructing the objects. After that, the correlation between each object is

calculated by using statistical method, and properties of the relations are extracted by relative words adapting.

However, these methods are focusing on the problems of relations extraction from sentences or web pages. They depend on some templates. And none of them organizes the character relations from pure text set to a map.

In this paper, we propose a new method of character relations extraction from Chinese text set automatically. The method extracts and organizes the character relations which are dispersed in the text set. Moreover, we extract the characters which are nearby the center character in relation, and construct a visual character relations map to describe the relations and the relative properties between them.

II. THE METHOD OF CHARACTER RELATIONS EXTRACTION

A. Definition of the Problem We concern two main points of character relations in the

text set: One is relative property and the other is relative strength.

Relative Property (RP): it is the type of relation between two characters (such as father-son, mother-son, brotherhood, master-apprentice and so on).

The relative property is reflected by the relative word in the sentence. In following two sentences, for example,

13 08 ,, the words “

(coach)” and “ (prentice)” reflect master-apprentice relation between (Liu Xiang) and (Sun Haiping) , and in the sentence , the word “(wife)” reflects the husband-wife relation between (Yao Ming) and (Ye Li). These bidirectional descriptions, such as master-apprentice and husband-wife, are relative properties.

Relative Strength (RS): it is the degree of relation between two characters in the text set.

The relative strength measures the description extent of the relation in the text set. For example, if a text set describes the relation between (Liu Xiang) and (Sun Haiping) many times but mentions the relations between (Liu Xiang) and (Yao Ming) only once, the relative strength of the former is higher than the letter.

Supported by National 242 Project (No. 2005C48), Basic Research Foundation of Beijing Institute of Technology (No. 20060142014) and Graduated Student Science & Technology Creative Project of Beijing Institute of Technology (No. GC200802).

978-1-4244-4994-1/09/$25.00 ©2009 IEEE

Based on the definitions above, the problem of character relations extraction can be represented by following three steps:

(1) Extracting characters and computing relative strength for each pair of them.

(2) Recognizing the relative properties for each pair of characters.

According to the relations, organizing all character pairs, and constructing the character relations map.

B. Extracting Characters and Computing Relative Strength Because there is no separator between words in Chinese

sentence, the first step must be word splitting and POS (Part of Speech) tagging. In this paper, we use improved ICTCLAS[9]

(Institute of Computing Technology, Chinese Lexical Analysis System) to deal with this problem. ICTCLAS can distinguish personal name. The precision of personal name distinguishing is 98% in 973 project evaluation. This function can be used to recognized characters from sentences.

In text set, relative strength is depended on two kinds of characters’ co-occurred information, co-occurrence in sentence and co-occurrence in text. These two co-occurrences must be considered all together in relative strength computing.

Characters, which appear in the same sentence, are correlative whatever the sentence describes relation (like “

”), state (like “”) or action (like “

”). Characters’ co-occurrence in sentence is quantified by using

=

×=

M

mk

mk

mjmiij p

ppM

L1

2

1 (1)

where ijL is the strength of co-occurrence in sentence between

character i and j , named local gene; mip and mjp are the appearance frequency of character i and j in the sentence

m ;k

mkp2 is the normalization parameter which is used to

avoid the influence coursed by the length of sentence; M is the number of sentences in the text.

Characters’ co-occurrence in text is quantified as

×=

kk

jiij q

qqG 2

(2)

where ijG is the strength of co-occurrence in text between

character i and j , named global gene; iq and jq are the

appearance frequency of character i and j in the text, k

kq2

is the normalization parameter which is used to avoid the influence coursed by the length of the text.

Local gene and global gene are used to calculate the relative strength between two characters, as

[ ]=

+=N

dijijij dGdLrs

1)()( (3)

where N is the number of texts in the text set.

C. Recognizing Relative Properties We use a relation dictionary to map the relative words to

bidirectional relative properties which describe the type of relations between topical characters (the described characters) and comment characters (the characters in the descriptive segment). TABLE gives a demonstration of the dictionary.

TABLE I. DEMONSTRATION OF RELATION DICTIONARY

Relative Word Relative Property Relative Word Relative Property

(father) (father-son) (father-daughter)

(wife) (husband-wife)

(dad) (father-son) (father-daughter) (boyfriend)

(sweet heart)

(mother) (mother-son) (mother-daughter) (girlfriend)

(sweet heart)

(mama) (mother-son) (mother-daughter)

(coach) (master-apprentice)

(son) (father-son) (mother-son)

(teacher) (master-apprentice)

(daughter) (father-daughter) (mother-daughter)

(student) (master-apprentice)

(husband) (husband-wife) (prentice) (master-apprentice)

Because the relative words belong to noun which is an opening set, they are difficult to be completed. But users can build a limited relation dictionary as TABLE to indicate their interested properties and map the relative words to them. The limited relation dictionary is the basis of relative properties extracting. The words, which are tagged as noun in lexical analyzing, would be matched in the dictionary, and the matching words would be recognized as relative words. The relative properties of the relative words should be indexed from the dictionary directly.

After extracting the relative properties from the sentence, their ownership should be ascertained. It is possible that there are multiple characters and relative words in one sentence. In the following sentence, for example, the relative word “(wife)” reflects the relation “ ( husband-wife)” between

(Yao Ming) and (Ye Li), and the relative word “(dad)” reflects the relation “ (father-son)” between (Yao Ming) and (Yao Zhiyuan).

“”

Via analyzing more than 9700 sentences manually, we find that the relative word always belongs to the nearest two characters (It is not absolute, but is overwhelming majority). According to this principle, we use the distance between relative word and characters pair as the judgmental feature, and use the 9700 sentences to construct bayesian classifier to estimate the ownerships of the relative words.

(a) Unfolded at Level 1 (b) Unfolded at Level 2

Besides the ownerships evaluating, there is another problem that the relative property of two characters may be different in different sentences. In the following two sentence, for example, two different relative properties “ (master-prentice)” and “

(father-son)” between (Lui Xiang) and (Sun Haiping) can be extracted, but only one relative property would be made certain. Moreover, some relative words, which are mapped to multiple relative properties, can lead the confusion too.

“”

“ ”

For solving this problem, we consider the possibility of the relative properties between two characters in the text set. By calculating the weight of each relative property according to the statistical frequency, the most possible relative property with the largest value of the weight could be ascertained as the right property between two characters. The calculating method of the weight is

Mriwiww i

d

N

ddi ==

=

)(,)(1

(4)

where )(iwd is the weight of relative property i in the text

d , N is the number of texts in the text set, ir is the appearing times of property i , and M is the number of sentences in the text.

D. Organizing Relations and Constructing Relation Map We use a group of relative coefficients (including relative

strength and an array of relative properties) to represent the relations which are dispersed in text set, as

>=< ijijij RPrsRC , (5)

where ijrs and ijRP denote the relative strength and the array of relative properties between character i and j .

The relative strength of the coefficient is a scalar, and it could be calculated by using formula (1)-(3); relative properties array records all relative properties between the two characters, it is described as

( ) ( ) ( )[ ]mmij wrpwrpwrpRP ,,...,,,, 2211= (6)

where irp denotes a kind of relative property; iw is the

attributive weight of irp in the text set. The calculating

method of iw is as (4).

For the purpose of exhibiting relations visually, we construct the relations map according to the relative coefficients. Relations map is a kind of star-shaped graph which describes the character relations in text set. It contains 3 kinds of elements: character, relation line and relative property. There is a central character in the map, and some characters which are nearby the center in relation around the central character and

contracted by relation lines. Each relation line is tagged the relative property. The map can be unfolded at multiple levels. Fig. 1 is the sample of character relations map which is constructed by the method proposed in this paper. Graph (a) of Fig. 1 is the map with the central character “ (Yao Ming)” and unfolded at level 1; Graph (b) is partly unfolded state of (a) at level 2.

Figure 1. Sample of Relation Map

For the purpose of filtering the characters which are far distant from the central character in relation, a threshold would be used to be the filtering standard. The threshold is calculated by using

= ==

−+=H

j

H

jijij

H

jiji rs

Hrs

Hrs

H 1

2

11

111τ (7)

where H is the number of characters which are relating with i in the text set. Treating all relative strength about character ias the sequence { }Hjrsij ,...,2,1= , formula (7) means that the relative strength threshold equals the mean of the sequence add the standard variance of the sequence, it considers the fluctuating range of the sequence.

In the relations map, the relative property between two characters should be the most possible one which has the largest weight value. It can be extracted from the relative coefficient directly.

III. EXPERIMENTS AND RESULTS

A. Experiment of Relations Extraction The text data which is used in this experiment is collected

from the internet. By using the keywords “ (Yao Ming)”, “(Liu Xiang)”, “ (Zhou Jielun)”, “ (Zhou

Xingchi)” and “ (Kobe)” (which are popular characters in web search), 1348 texts are downloaded from news websites. The text sets distribution is shown in TABLE . The number of characters in these text sets is counted manually.

We use proposed method to extract the character relations from each text set, and construct the coefficient for each character pair automatically. By contrasting the auto-extractive result with TABLE , calculate the precisions, recalls and F-Scores. The experimental results are as TABLE .

TABLE II. TEXT SET OF THE EXPERIMENT

Text Set Searching Keyword

Number of Texts

Number of Characters

Number of Relative Properties

Set 1 405 509 94 Set 2 326 226 73 Set 3 207 214 82 Set 4 210 204 62 Set 5 200 463 102

Total 1348 1616 413

TABLE III. EXPERIMENTAL RESULTS OF RELATIONS EXTRACTION

Text Set Precision Recall F-Score Set 1 85.71% 57.44% 68.78% Set 2 94.34% 68.49% 79.36% Set 3 85.55% 57.31% 68.64% Set 4 82.22% 59.68% 69.16% Set 5 80.00% 66.77% 72.79%

Average 85.56% 61.94% 71.86% The experimental results show that, the precision and recall

of the relative properties auto-extraction achieve 85.56% and 61.94% respectively.

B. Experiment of Relation Map In this experiment, we choose 100-400 texts randomly from

the first text set above whose searching keywords is “ (Yao Ming)” to construct 4 new sets, and use the method proposed in this paper to extract the relations map from each one. Every map can be unfolded at 2 levels. The experimental results are as TABLE .

TABLE IV. EXPERIMENTAL RESULTS OF RELATION MAP

Number of Texts

Unfolded at Level 1 Unfolded at Level 2Precision Recall F-Score Precision Recall F-Score

100 60.00% 15.79% 25.00% 83.33% 26.32% 40.00% 200 71.43% 35.71% 47.62% 87.50% 36.84% 51.85% 300 75.00% 50.00% 60.00% 92.31% 57.14% 70.59% 400 94.74% 52.94% 67.92% 96.88% 59.62% 73.81%

TABLE shows that, in condition of 400 texts, precisions of level 1 and level 2 are 94.74% and 96.88% respectively, recalls are 52.94% and 59.62% respectively. Curves of precisions, recalls and F-Scores are as Fig. 2.

Figure 2. Sample of Relation Map

Fig. 2 shows that, the precision and recall are growing with the increasing of texts number, no matter at level 1 or level 2. It proves that the more texts about central character in text set, the better accuracy and comprehensiveness would be.

IV. CONCLUSIONS

This paper proposed a method of character relations auto-extraction from Chinese text set. In this method, character relations which are dispersed in text set are extracted and organized automatically. And a relations map, which describes the distance of characters in relation and their relative properties, can be constructed. Because of using relation dictionary, single-directional relative words are mapped to bidirectional properties, so that relative ambiguities which coursed by different form of words reflecting the same relative property are avoided. The experimental results prove the effectiveness of this method, and show that, with increasing of texts number in text set, the relative properties extraction is more and more exact and comprehensive.

However, there are some possible future works in following research: (1) besides relation dictionary, some pattern recognition methods could be used for recognizing the relative property; (2) more features, such as POS, predication and type of syntax could be used for the judgment of the ownership of relative properties; (3) the precision and the recall of the relations map are increased with the increasing of texts number, but it is not continuous at all. They will converge to the limits. We can introduce some methods of syntax and semantic analysis into the relations extraction to enhance the limit and make the extraction more exact.

REFERENCES

[1] W. X. Che, T. Liu, S. Li, “Automatic entity relation extraction,” Journal of Chinese Information Processing, vol. 19, no. 2, pp. 1-6, 2004 (in Chinese)

[2] L. Liu,B. C. Li, and X. F. Zhang, “Named entity relation extraction based on SVM training by positive and negative cases,” Computer Applications, vol. 28, no. 6, pp. 1444-1446, 2008 (in Chinese)

[3] T. He, C. Xu, and J. Li et al, “Named entity relation extraction method based on seed self-expansion,” Computer Engineering, vol. 32, no. 21, pp. 183-184, 2006 (in Chinese)

[4] B. Deng, X. Z. Fan, and L. G. Yang, “Entity relation extraction method using smenatic pattern,” Computer Engineering, vol. 33, no. 10, pp. 212-214, 2007 (in Chinese)

[5] C. L. Yao, and N. Di, “Solution to large scale extraction of social relations of characters based on web,” Pattern Recognition and Artificial Intelligence, vol. 20, no. 6, pp. 740-744, 2007 (in Chinese)

[6] Y. Q. Dong, Q. Z. Li, and H. Li et al, “Building web domain data integration system with user collaboration,” In Proceedings of the 12th International Conference on Computer Supported Cooperative Work in Design, pp. 47-51, 2008

[7] Z. Q. Nie, J. R. Wen, and W. Y. Ma, “Object-level vertical search,” In Proceedings of 3rd Biennial Conference on Innovative Data Systems Research, pp. 235-246, 2007

[8] Z. Q. Nie, Y. X. Ma, and S. M. Shi et al, “Web object retrieval,” In Proceedings of the 16th International Conference on World Wide Web, pp. 81-92, 2007

[9] H. P. Zhang, H. K. Yu, and D. Y. Xiong et al, “HHMM-based Chinese lexical analyzer ICTCLAS,” In Proceedings of 2nd SIGHAN Workshop on Chinese Language Processing, pp. 184-187, 2003

Documents

[IEEE 2009 International Conference on Information Engineering and Computer Science - Wuhan, China (2009.12.19-2009.12.20)] 2009 International Conference on Information Engineering