6
Hyponymy Acquisition from Chinese Text by SVM Fang TIAN Faculty of Engineering, The University of Tokushima Tokushima, Japan [email protected] Fuji REN 1, Faculty of Engineering, The University of Tokushima Tokushima, Japan 2, School of Information Engineering, Beijing University of Posts and Telecommunications Beijing, China [email protected] Abstract: Hyponymy as one of semantic relation taxonomies provides a fundamental knowledge for natural language processing applications. In this paper, we propose a method for automatically learning hyponymy terms by machine learning technique from text for Chinese. Our method relies on hand-crafted hyponymy patterns, and uses the syntactic features to build a multiple classifier to identify novel hyponymy pairs (hyponym /hypernym or hypernym /hyponym) in a sentence by SVM. Experimental results show that the method is effective in acquiring hyponymy from Chinese free text. Keywords: Hyponymym; hypernym/hyponym; SVM 1. Introduction Following the definition of hyponymy (is-a) for WordNet by Miller[1], a term X is a hyponym of a term Y if X is an instance or subclass of Y, such as “Yellow River” is a hyponym of “River”, China” is a hyponym of “Country”, and “Country” is a hypernym of “China”, and so on. Hyponymy is one basic way to describe semantic relationship, providing a fundamental knowledge for organizing information, like ontology learning [2]. As taxonomic information for the concept, hyponymy has been widely used in Natural Language Processing (NLP), such as recognizing the named entity [3]. Taking the university name as an example, since the university are usually named by a pattern of “<location name>+[disciplines|industry]+“ 大学 | 学院 (university|college)” in Chinese, among which, the “location name” or “disciplines/industry” belong to different hypernyms. Matching examples include “北京 大学” (“Peking University”), “北京理工大学” (“Beijing Institute of Technology”) etc.. 978-1-4244-4538-7/09/$25.00 ©2009 IEEE Collecting hyponym/hypernym manually is a time-consuming and expensive work. Consequently, there has been research on finding methods for automatically acquiring hyponymy [4][6]. There are two basic approaches presently for extracting hyponym/hypernym from free text. One is the pattern based way, which first constructs the relationship pattern of hyponymy from linguistic knowledge and natural language processing result like lexical or syntactic parsing, then finds hyponym/hypernym relationship by pattern matching, or relation classifying by using pattern as features [4][12]. The other one is statistical way, which computes the correlation of concepts on the basis of corpus and statistical language model, and then automatically acquires hyponym/hypernym relationship [5][13]. In automatic hyponymy acquisition, hyponymy have been identified by hyponymy-only classifier or hyponymy-coordinate hybrid classifier based machine learning technology [4][6][14][15]. The ordered noun pair (China, Country) or the ordered noun pair (Country, China) are decided as hyponymy by the classifier based hyponymy pattern in early work [4]. For example, the noun pair (China, Country) or (Country, China) may match with ones pattern such as “<NP Y such as NP X >”, “<NP X is a NP Y >” (NP, noun phrase), or others hyponymy pattern within a sentence. Although it has seen good result in hyponymy acquisition for English text, such result can not be applied to Chinese text due to the differences in language usages and characteristics. In this paper, we propose an automatic method of acquiring hyponymy terms from text for Chinese. We build a multiple classifier for hyponymy relation and coordinate relation based on support vector machine (SVM). For acquiring the description of hyponymy exactly, hyponymy is divided into hyponym/hypernym and hypernym/hyponym in detail in the classifier. We firstly find regular expression hyponymy patterns, then

[IEEE 2009 International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE) - Dalian, China (2009.09.24-2009.09.27)] 2009 International Conference on Natural

  • Upload
    fuji

  • View
    215

  • Download
    3

Embed Size (px)

Citation preview

Page 1: [IEEE 2009 International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE) - Dalian, China (2009.09.24-2009.09.27)] 2009 International Conference on Natural

Hyponymy Acquisition from Chinese Text by SVM

Fang TIAN

Faculty of Engineering,

The University of Tokushima

Tokushima, Japan

[email protected]

Fuji REN

1, Faculty of Engineering,

The University of Tokushima

Tokushima, Japan

2, School of Information Engineering, Beijing

University of Posts and Telecommunications

Beijing, China

[email protected] Abstract:

Hyponymy as one of semantic relation taxonomies provides a fundamental knowledge for natural language processing applications. In this paper, we propose a method for automatically learning hyponymy terms by machine learning technique from text for Chinese. Our method relies on hand-crafted hyponymy patterns, and uses the syntactic features to build a multiple classifier to identify novel hyponymy pairs (hyponym /hypernym or hypernym /hyponym) in a sentence by SVM. Experimental results show that the method is effective in acquiring hyponymy from Chinese free text.

Keywords:

Hyponymym; hypernym/hyponym; SVM

1. Introduction

Following the definition of hyponymy (is-a) for WordNet by Miller[1], a term X is a hyponym of a term Y if X is an instance or subclass of Y, such as “Yellow River” is a hyponym of “River”, China” is a hyponym of “Country”, and “Country” is a hypernym of “China”, and so on.

Hyponymy is one basic way to describe semantic

relationship, providing a fundamental knowledge for organizing information, like ontology learning [2]. As taxonomic information for the concept, hyponymy has been widely used in Natural Language Processing (NLP), such as recognizing the named entity [3]. Taking the university name as an example, since the university are usually named by a pattern of “<location name>+[disciplines|industry]+“ 大 学 ” | “ 学 院 ” (university|college)” in Chinese, among which, the “location name” or “disciplines/industry” belong to different hypernyms. Matching examples include “北京

大学” (“Peking University”), “北京理工大学” (“Beijing Institute of Technology”) etc..

978-1-4244-4538-7/09/$25.00 ©2009 IEEE

Collecting hyponym/hypernym manually is a

time-consuming and expensive work. Consequently, there has been research on finding methods for automatically acquiring hyponymy [4][6]. There are two basic approaches presently for extracting hyponym/hypernym from free text. One is the pattern based way, which first constructs the relationship pattern of hyponymy from linguistic knowledge and natural language processing result like lexical or syntactic parsing, then finds hyponym/hypernym relationship by pattern matching, or relation classifying by using pattern as features [4][12]. The other one is statistical way, which computes the correlation of concepts on the basis of corpus and statistical language model, and then automatically acquires hyponym/hypernym relationship [5][13].

In automatic hyponymy acquisition, hyponymy

have been identified by hyponymy-only classifier or hyponymy-coordinate hybrid classifier based machine learning technology [4][6][14][15]. The ordered noun pair (China, Country) or the ordered noun pair (Country, China) are decided as hyponymy by the classifier based hyponymy pattern in early work [4]. For example, the noun pair (China, Country) or (Country, China) may match with ones pattern such as “<NPY such as NPX>”, “<NPX is a NPY>” (NP, noun phrase), or others hyponymy pattern within a sentence. Although it has seen good result in hyponymy acquisition for English text, such result can not be applied to Chinese text due to the differences in language usages and characteristics.

In this paper, we propose an automatic method of

acquiring hyponymy terms from text for Chinese. We build a multiple classifier for hyponymy relation and coordinate relation based on support vector machine (SVM). For acquiring the description of hyponymy exactly, hyponymy is divided into hyponym/hypernym and hypernym/hyponym in detail in the classifier. We firstly find regular expression hyponymy patterns, then

Page 2: [IEEE 2009 International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE) - Dalian, China (2009.09.24-2009.09.27)] 2009 International Conference on Natural

define features based on syntactic features of the patterns, finally identify novel hyponymy pairs in a sentence based on these features by SVM.

2. Related Works

The data resources for extracting hyponymy include unstructured free texts [4][7][8][10] and structured texts, such as HTML [9] and catalog describing format used in Wikipedia [6]. Since the free text is the main way to describe information, in this paper we focus on the hyponymy extraction for free text.

Much of pervious work [4][6][9][10][12] on

hyponymy acquisition from free text is based the lexico-syntactic patterns by Hearst [7] those patterns summarize the most common ways of expressing hyponymy between two NPs in English, “NP {,} including {NP,} *{ or arrowvert and } NP”.

Based Hearst pattern have been applied to acquiring

hyponymy in various language. The hyponymy acquisition for Chinese has been researched, the lexico-syntactic patterns were proposed such as “<NPX><是|为>一<量词><NPY>” (NPX is a NPY, “是|为”(is), “一”(a/an), “量词”(quantifier)) in [10]. While these patterns have been successful at identifying some hyponymy for Chinese, this method limited by small patterns and depended on a dictionary that was used as a word filter constructed semi-automatically.

Other several pieces of work exploit the method by

commonality property nouns and domain verbs [8], it can identify hyponymy relation, but is limited in domain knowledge.

In the work of hyponymy extraction that are based

on the machine learning [4][6], Simida have acquired hyponymy relation automatically with the features that depend on the hierarchical layouts of Wikipedia [6]. In hyponymy acquisition from free text, Snow [4] have used hyponym/hypernym pairs as seeds in WordNet to learn lexico-syntactic patterns that represent with dependency paths, and then to use the acquired patterns as a feature of classifier to determine that a new noun pair is a hyponym/hypernym relation or not. As the same time, according to the research result of [11], coordination information can improve the precision of hyponymy extraction. Snow's method has been combined hypernym-coordinate hybrid classifier for improve hyponymy extraction [4]. Although Snow [4] proposed four labelers for their classifier, they didn't take the effect of hyponym/hypernym or hypernym/hyponym into account when defining pattern features.

In order to identify hyponymy distinctly for

hyponym/hypernym and hypernym/hyponym, we propose a multiple classifier for hyponymy (including hyponym/hypernym and hypernym/hyponym) and coordinate relation based on SVM for Chinese.

3. Acquisition Algorithm

We propose a method of hyponymy acquisition by classifier based the hand-crafted patterns using SVM for Chinese. There are defined four classifications for the classifier: 1-hypernym/hyponym, 2-hyponym/hypernym, 3-coordinate and 4-others. We divide the original hyponymy into hyponym/hypernym relation and hypernym/hyponym relation. Such fine grained relations are more effective in describing hyponymy uniformly. For the coordinate, we make such hypothesis that there are no overlaps in hyponym and coordinate. Coordinate relation was defined in WordNet [1] that coordinate terms are with the same hypernym as the search string.

3.1. Building Corpus

We collected free texts from Internet and then preprocess these texts in order to build the corpus. Following is the procedures:

Step 1: Sentence segmenting: divide the text by one

sentence unit; Step 2: Pattern matching: use the point words (the

point words are mainly functional words for hyponymy lexico-syntactic patterns, will be introduced in next sub-section) to match predefined patterns and filter sentences which match with the patterns;

Step 3: Syntactic parsing: process word segmentation and syntactic parsing with language processing tools LTP1.

Step 4: Noun-pair extracting: extract noun pairs T(N1, N2) within sentences from left to right, for example, from sentence “青海湖/n 是/v 中国/n 最/d 大 /a 的 /u 咸水湖 /n” (Qinghai Lake is the largest saltwater lake in China), we extract the noun-pair T(青海

湖/Qinghai lake, 中国/China), T(青海/Qinghai lake, 盐水湖 /Saltwater lake), and T( 中国 /China, 盐水湖/Saltwater lake);

Step 5: Class-label annotating: annotate each ordered noun pair T(N1, N2) as one of the four labels for the classifications, and annotate the noun pair relevant features based on the patterns with the occurred sentence.

1 LTP, Language Technology Platform developed by IRLab of Harbin Institute Technology.

Page 3: [IEEE 2009 International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE) - Dalian, China (2009.09.24-2009.09.27)] 2009 International Conference on Natural

3.2 Hyponymy Patterns for Chinese

We collected manually lexico-syntactic patterns for hyponymy acquisition for Chinese. Table 1 gives some examples of the hand-crafted pattern. Since in Chinese there are no boundaries between words, the word segmentation is the prerequisite for Chinese language processing. In order to get rid of ambiguities, for instance, “等”(and others) and “等待”(wait), “如”(such as) and “如果” (if) have totally different meaning, we add the part-of-speech tags into the patterns, for example, “等/u (and others/auxiliary)”, “和/c (and/conjunction)” and so on.

Table 1. The hand-crafted hyponymy patterns for

Chinese

3.3 Features Definition Based Lexico-syntactic Patterns

Among the work of SVM based approaches to acquire hyponymy[4][6], Snow et al. used “whether matches or not” and “relevance results using the number of matching based statistics” as classification features. In order to acquire hyponymy from free text, Snow proposed to express the patterns using syntactic dependency path. Because the patterns were expressed by adding “satellite links” to the shortest path only for noun pairs within a sentence, these may cause some of hyponymy patterns lost some information of lexico-syntactic patterns. For example, the pattern “NPX is a NPY” left out the word “a” in the presentation “N: S: VBE, be, be,-VBE: PRED: N” in [4]. For this reason, to avoid the accuracy decrease caused by such information, we build pattern by the point words that the functional words appeared in lexico-syntactic patterns shown in Table 2. The point words include verbs such as “是|有|如”(be|including|such as, degree adverbs such as “最|

较”(the better |the best), auxiliary such as “等|等等”(and others), etc.. In addition, we classify patterns into two groups according to the ordering of NPX, and NPY.

For an ordered noun pair T (N1, N2) in a sentence, it

should be as different classification result of hyponymy (hyponym/hypernym or hypernym/hyponym) based on hyponymy patterns. In this paper, we propose that if the pair T(N1, N2) matches with the pattern Type 1, then noun pair T(N1, N2) corresponds to the pattern T(NX, NY) and the relation of N1 and N2 is hypernym/hyponym, for example, the noun pair T(青海湖/Qinghai lake, 盐水湖/Saltwater lake) that match with the pattern P3 in the sentence “青海湖是中国最大的咸水湖” (Qinghai Lake is the largest saltwater lake in China); if the noun pair T(N1, N2 ) matches with the pattern Type 2, then noun pair T(N1, N2) corresponds to the pattern T(NY, NX) and the relation of N1 and N2 is hyponym/hypernym, such as the noun pair T (盐水湖/Saltwater lake, 青海湖/Qinghai lake) that match with the pattern P4 in the sentence “中国最大的咸水湖是青海湖” (The largest saltwater lake is Qinghai Lake in China).

Table 2. The disassembled hyponymy patterns

We define features for the ordered two nouns N1

and N2 with a sentence based the different hyponymy patterns. The definitions of the primary features are shown by Table 3. The corresponding feature value is set when ordered two nouns N1 and N2 with their occurred sentence coincided with the patterns. The rest of the feature values will be set to their default value “0”. We use dependency paths to express the patterns (from P2 to P6) for the features F3 and F4. Figure 1 shows a dependency tree for the example sentence “青海湖是中

Page 4: [IEEE 2009 International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE) - Dalian, China (2009.09.24-2009.09.27)] 2009 International Conference on Natural

国最大的咸水湖 ”(parsed by LTP: SBV for the dependency path of subject-verb, VOB for the dependency path of verb-object. “青海湖” (Qinghai lake), “中国” (China), “咸水湖” (Saltwater lake), and shows the labeled values of features F4 and F5 for each nouns pairs within this sentence.

At the same time, we consider the feature F4 for the patterns including “<NPY>< 有 / verb ><NPZ>, and <NPX>:COO<NPZ>” that in order to link the relations between hyponymy and coordinate the within a sentence. That is to say, the corresponding feature value of F4 is determined by the dependency path (SBV or VOB) between NPY and NPZ based the pattern P5, and there is a candidate dependency path between NPZ and NPX (NPZ is anyone noun that match the pattern within a sentence). As a result, the classifier can determine a set of hyponymy for the sentence which has hyponymy and coordinate.

Figure 1. Dependency tree example by LTP, and part of

features for the noun pairs

Table 3. The definition of the primary features based hyponymy patterns

In addition to the feature based on hand-crafted

pattern of hyponymy, the others features including the dependency path of the two nouns each other, the word of common dependency of the two nouns, the element

between of two nouns (if there have just one element.), the elements of ahead or behind of each nouns and so on. We define coordinate relation based the coordinate patterns [11] that nouns linked by conjunction “|和|及|以及” (and), or in pause mark “、”2 separated lists for Chinese. Moreover, the coordinate relation is considered with the dependency path “COO” (“COO” for dependency path of coordinate by syntactic analysis of LTP).

4. Experimental Setup and Results

Our approach is as follows: firstly, training a multi-classifier for hyponymy acquisition based on extracted features using SVM3, secondly, determining whether each of given noun pairs is hypernym/hyponym relation, hyponym/hypernym relation, candidate relation or others by the classifier.

4.1 Experimental Setup

We firstly processed the free texts into corpus in the way introduced in Section 3.1. We collected texts (394K) from the Baidu Encyclopedia4, most of the texts in which are descriptive documents. In order to avoid the propagation of the error caused by lexical or syntactic analysis, we manually delete the wrongly anglicized sentences, and finally use 6,716 sentences for training data, from which we extracted 4,569 noun pairs (with occurring order in the sentence). Then from the 54,303 noun pairs we selected 8,001 ordered noun pairs for test examples.

4.2 Experimental Evaluation and Discussion

Table 4 shows that the experiment results. We get 53 hyponymy noun pairs of type 1 hyponym/hypernym relation and 86 hyponymy noun pairs of Type 2 hypernym/hyponym relation from the test data. Although it performs well in accuracy, the recall is relative low. In the coordinate relations are excluded, the recall is only 29.5%, and the F score is 0.40. One possible reason for the lower recall is that the manually built patter are too limited to capture the complicated language usages, another possible reason is the flexibility of language, it's hard to acquire a complete set of hypernym/hyponym rules, for example, the Chinese sentence “含有叶酸的水

果:苹果、香蕉、芒果、木瓜、猕猴桃” (Includes the folic acid fruits: apple, banana, mango, papaya, kiwi fruit). In addition, the main reasons of affect the extraction

2 In Chinese, the punctuation “、” is a mark always used to separate coordinating elements, and occurs within the characters, words and phrases. 3 We use Library SVM tools, http://www.csie.ntu.tw/`cjlin/libsvm. 4 http://baike.baidu.com/

Page 5: [IEEE 2009 International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE) - Dalian, China (2009.09.24-2009.09.27)] 2009 International Conference on Natural

precision, one, there are some incorrect descriptions in free text; second, the point words in pattern have multi-semantic, for example, the word “有”(including) in Chinese, not only may express the concept on hyponymy relation, but may also express the concept the part-of relation; third, the segmentation and syntactic parsing are not precise enough, etc..

Table 4. The experimental evaluations

In order to not leave out the information for the

patterns, we defined different features based on the patterns that introduced in Section 3.3, such as defined the feature F5 based pattern P2 that considered the point word “a” and others for fear of loss of the pattern accuracy in Table 3. When delete the features F5-F7 for new testing, the result decreased by 0.52 in F score. This result shows that more accurate patterns will yields better performance.

Table 5. Examples of acquired hyponymy

From Table 5 we see that the classifier divide

successfully hyponym into two fine-grained classes hyponym/hypernym and hypernym/hyponym. For a noun pair of hyponymy, such classification can distinguish clearly between hyponym and hypernym. Moreover, we can acquire a set of hyponymy for a sentence that match the patterns such as P5 and P6. For example, a hyponymy set was acquired for the thirdly sentence shown as Table 5. We compare the acquired results with or without the feature label for the pattern that link hyponymy and coordinate relations. The

experiment results indicate that hyponym set can not been acquired directly for a sentence without the label. Therefore, the use of coordinate relations extends the hyponymy

5. Conclusions and Future Works

In this paper, we discussed a novel way to extract hyponymy from the free Chinese text by SVM, in which the lexical patterns are used as classification features. Through dividing the lexical patterns into fine-grained categories, we can extract the ordering hyponymy effectively. The experimental results show that the limited patters will leave out some hyponymy. Additionally, the uncertainty of patters will degrade the system accuracy.

In the future work, we will make more robust

patterns to catch more language instances. Another interesting direction is to build a fast, effective hyponymy acquisition system for the practical use.

Acknowledgements

This research has been partially supported by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Scientific Research (B), 19300029, and Challenging Exploratory Research, 21650030. Many thanks to all our colleagues participating in this project. We also thank Dr. Motoyuki Suzuki, Dr. Kazuyuki Matumoto and Dr. Caixia Yuan for useful discussion of this work.

References

[1] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller, “Introduction to WordNet: An on-line lexical database”, In Journal of Lexicography, Vol. 3, No.4, pp. 235-244, January 1990.

[2] M. Kavalec, E. Maedche, and V. Svatek, “Discovery of Lexical Entries for Non-taxonomic Relations in Ontology Learning”, In Proceedings of SOFSEM 2004, Springer Berlin / Heidelberg, LNCS 2932, pp.249-256, 2004.

[3] Ying Yu, Xiao-long Wang and Liu Bing-qua, “A Method of Automatic Recognition for Chinese Organization Name Based on SVM/RS”, Journal of Electronics and Information Technology, Vol.28, No.5, May 2006.

[4] R. Snow, D. Jurafsky and A. Y. Ng., “Learning syntactic patterns for automatic hypernym discovery”, In: Advances in Neural Information Processing Systems 17 (NIPS), Vancouver, British Columbia, pp.1297-1304, December 2005.

Page 6: [IEEE 2009 International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE) - Dalian, China (2009.09.24-2009.09.27)] 2009 International Conference on Natural

[5] P. Pantel and D. Ravichandran, “Automatically Labeling Semantic Classes”, In Proceedings of Human Language Technology/North American Association for Computational Linguistics (HLT/NAACL-04), pp. 321-328. Boston, MA, 2004.

[6] A. Sumida, N. Yoshinaga and K. Torisawa, “Boosting Precision and Recall of Hyponymy Relation Acquisition from Hierarchical Layouts in Wikipedia”, In Proceedings of the sixth Language Resources and Evaluation Conference (LREC), Marrakech, Morocco, pp.2462-2469, May. 2008.

[7] A. Hearst, “Automatic Acquisition of Hyponyms from Large Text Corpora”, In: Proc. Of COLING 1992, pp.23-28, 1992.

[8] Yongwei Hu and Zhifang Sui, “Extracting Hyponymy Relation between Chinese Terms”, Lecture Notes in Computer Science,Volume 4993/2008, pp.567-572, May. 2008.

[9] S. Keiji and T. Kentaro, “Automatic acquisition of hyponymy relations from HTML documents”, Journal of natural language processing, Information Processing Society of Japan (IPSJ), 12(1), pp.125-150, January 2005.

[10] Cungen Cao, Haitao Wang, and Wei Chen, “A Method of Hyponym Acquisition Based on “isa” Pattern”, Computer Science(Chinese), Vol.33, No.9, 2006.

[11] S. Cederberg and D. Widdows, “Using LSA and Noun Coordination Information to Improve the Precision and Recall of Automatic Hyponymy Extraction”, Proc. of CoNLL-2003, pp.111-118, 2003.

[12] M. Ando, S. Sekinge and S. Ishizaki, “Automatic Extraction of Hyponyms from Newpaper Using Lexicosyntactic Pattern”, IPSJSIG (Information Processing Society of Japan) Technical Report, pp.77-82, 2003.

[13] H. Nakawatase and A. Aizawa, “Discovering IS\_A Relationships from Text: A Method Based on Dependencies between Nouns and Verbs”, Transaction of The Japanese Society for Artificial Intelligence(TJSAI), p:585-594, Vol.22, No.6, 2007.

[14] R. Girju, A. Badulescu and D. Moldovan, “Learning Semantic Constraints for The Automatic Discovery of Part-Whole Relations”, Proc. of HLT-2003, 2003.

[15] M. Ciaramita, T. Hofmann and M. Johnson, “Hierarchical Semantic Classification: Word Sense Disambiguation with World Knowledge”, Proc. of IJCAI-2003, 2003.