119
SSIMIP-2002 (Jul y 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Informat ion Engineering National Taiwan University Taipei, Taiwan E-mail: [email protected] 臺臺臺臺 臺臺臺臺臺臺臺臺臺 臺臺臺臺臺 臺臺臺 Natural Language Processing Lab. National Taiwan University

SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Embed Size (px)

Citation preview

Page 1: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

SSIMIP-2002 (July 11)

Chapter 13

Chinese Information Extraction Technologies

Hsin-Hsi Chen

Department of Computer Science and Information Engineering

National Taiwan University

Taipei, Taiwan

E-mail: [email protected]

臺灣大學

自然語言處理實驗室資訊工程學 研究所

Natural Language Processing Lab. National Taiwan University

Page 2: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 2

Outline

• Introduction to Information Extraction (IE)

• Chinese IE Technologies

• Tagging Environment for Chinese IE

• Applications

• Summary

Page 3: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 3

Introduction

Page 4: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 4

Introduction

• Information Extraction– the extraction or pulling out of pertinent information

from large volumes of texts

• Information Extraction System– an automated system to extract pertinent

information from large volumes of text

• Information Extraction Technologies– techniques used to automatically extract specified

information from text

(http://www.itl.nist.gov/iaui/894.02/related_projects/muc/)

Page 5: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 5

An Example in Air Vehicle Launch

• Original Document

• Named-Entity-Tagged Document

• Equivalence Classes

• Co-Reference Tagged Document

Page 6: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 6

<DOC><DOCID> NTU-AIR_LAUNCH- 中國時報 -19970612-002 </DOCID><DATASET> Air Vehicle Launch </DATASET><DD> 1997/06/12 </DD><DOCTYPE> 報紙報導 </DOCTYPE><DOCSRC> 中國時報 </DOCSRC><TEXT>【本報綜合紐約、華盛頓十一日外電報導】在華盛頓宣布首度出售「刺針」肩射防空飛彈 給南韓的第二天,美國與北韓今天在紐約恢復延擱已久的會談,這項預定三天的會談將以北韓的飛彈發展為重點,包括北韓準備部署射程 可涵蓋幾乎日本全境的「蘆洞」一號長程飛彈 的報導。

 美國國務院發言人柏恩斯說:「在有關北韓 飛彈擴散問題上,美方的確有多項關切之處。」美國官員也長期懷疑北韓正對伊朗和敘利亞輸出飛彈,並希望平壤加入禁止擴散此種武器的

red: location nameblue: date expressiongreen: organization namepurple: person name

Page 7: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 7

國際公約。美國官員已知會北韓說,倘若北韓希望與美國建立正常的外交關係,就必須減少飛彈輸出。

 這項有關北韓飛彈計劃的會談是雙方於一九九六年四月在德國柏林舉行的首度會談的後續談判。美國在該次會談中要求北韓停止生產、測試及出售飛彈給他國,尤其是敘利亞和伊朗兩國。

 美國副助理國務卿艾恩宏和北韓外交部對外事務局局長李衡哲分別為雙方的談判代表,會談預定在十三日結束。

 柏恩斯說:「美方非常關心所有北韓本身,或是北韓與中共、伊朗或其他國家的飛彈問題。我們認為就此與他們舉行會談是甚為重要。」

 而為提昇南韓陸軍的自衛能力,美國於昨天宣布準備出售價值三億零七百萬美元的一千零六十五枚刺針飛彈與其他武器給南韓,它說,這項交易不會使朝鮮半島的緊張局勢惡化。

Page 8: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 8

五角大廈說:「這項設備與支援的銷售不會影響該區基本軍事均勢。」

 國務院也表示全力支持此項包含兩百一十三座發射台、支援設備、零件與訓練的交易。

 柏恩斯說:「這項交易獲得政府內每一個人的全力支持,它符合我們在朝鮮半島的政策。」他強調:「我們的第一優先是防衛南韓。」

 如果國會同意,這將是華府對南韓出售防空 飛彈的第一筆交易。

</TEXT></DOC>

Page 9: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 9

<DOC><DOCID> NTU-AIR_LAUNCH- 中國時報 -19970612-002 </DOCID><DATASET> Air Vehicle Launch </DATASET><DD> 1997/06/12 </DD><DOCTYPE> 報紙報導 </DOCTYPE><DOCSRC> 中國時報 </DOCSRC><ISRELEVANT> NO </ISRELEVANT><TITLE> <ENAMEX TYPE="LOCATION"> 美 </ENAMEX>擬售 <ENAMEX TYPE="LOCATION"> 南韓 </ENAMEX>1065 枚刺針飛彈 </TITLE><TEXT>

【本報綜合 <ENAMEX TYPE="LOCATION"> 紐約 </ENAMEX> 、 <ENAMEX TYPE="LOCATION"> 華盛頓 </ENAMEX><TIMEX TYPE="DATE"> 十一日 </TIMEX> 外電報導】在 <ENAMEX TYPE="LOCATION"> 華盛頓 </ENAMEX> 宣布首度出售「刺針」肩射防空飛彈 給 <ENAMEX TYPE="LOCATION"> 南韓 </ENAMEX> 的 <TIMEX TYPE="DATE"> 第二天 </TIMEX> , <ENAMEX TYPE="LOCATION"> 美國 </ENAMEX> 與 <ENAMEX TYPE="LOCATION"> 北韓 </ENAMEX><TIMEX TYPE="DATE"> 今天 </TIMEX> 在<ENAMEX TYPE="LOCATION"> 紐約 </ENAMEX> 恢復延擱已久的會談,這項預定三天的會談將以 <ENAMEX TYPE="LOCATION"> 北韓 </ENAMEX> 的飛彈發展為重點,包括 <ENAMEX TYPE="LOCATION"> 北韓 </ENAMEX> 準備部署射程 可涵蓋幾乎 <ENAMEX TYPE="LOCATION"> 日本 </ENAMEX> 全境的「蘆洞」一號長程飛彈 的報導。

Page 10: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 10

<ID="3"> 十一日 <ID="4" REF="3" > 今天 <ID="5“ REF="3"> 出售「刺針」肩射防空飛彈 給南韓的第二天

<ID="63" > 延擱已久的會談 <ID=“66” REF=“63”> 一九九六年四月在德國柏林舉行的首度會談 的後續談判 <ID="65" REF="63"> 這項有關北韓飛彈計劃的會談 <ID="70" REF="65"> 會談 <ID="69" REF="65"> 會談 <ID="64" REF="63"> 這項預定三天的會談

Page 11: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 11

<DOC> <DOCID> NTU-AIR_LAUNCH- 中國時報 -19970612-002 </DOCID><DATASET> Air Vehicle Launch </DATASET><DD> 1997/06/12 </DD><DOCTYPE> 報紙報導 </DOCTYPE><DOCSRC> <COREF ID="1"> 中國時報 </COREF> </DOCSRC><ISRELEVANT> NO </ISRELEVANT><TITLE> <COREF ID="6"> 美 </COREF>擬售 <COREF ID="23"> 南韓 </COREF><COREF ID="45" REF="44" TYPE="IDENT" MIN=" 刺針飛彈 ">1065 枚刺針飛彈 </COREF> </TITLE><TEXT>

【 <COREF ID="2" REF="1" TYPE="IDENT"> 本報 </COREF> 綜合 <COREF ID="61"> 紐約 </COREF> 、 <COREF ID="8" STATUS="OPT" REF="6" TYPE="IDENT"> 華盛頓 </COREF><COREF ID="3">十一日 </COREF> 外電報導】在 <COREF ID="7" REF="6" TYPE="IDENT"> 華盛頓 </COREF> 宣布首度 <COREF ID="5" STATUS="OPT" REF="3" TYPE="IDENT" MIN=" 第二天 "> 出售「刺針」肩射防空飛彈 給 <COREF ID="24" REF="23" TYPE="IDENT"> 南韓 </COREF> 的第二天 </COREF> , <COREF ID="77"><COREF ID="9" REF="6" TYPE="IDENT"> 美國 </COREF> 與 <COREF ID="29"> 北韓 </COREF></COREF><COREF ID="4" REF="3" TYPE="IDENT"> 今天 </COREF> 在 <COREF ID="62" REF="61" TYPE="IDENT"> 紐約 </COREF> 恢復 <COREF ID="63" MIN=" 會談 "> 延擱已久的會談 </COREF> , <COREF ID="64" REF="63" TYPE="IDENT" MIN=" 會談 "> 這項預定三天的會談 </COREF> 將以 <COREF ID="81" STATUS="OPT" REF="75" TYPE="IDENT" MIN=" 飛彈 "><COREF ID="30" REF="29" TYPE="IDENT"> 北韓 </COREF> 的飛彈 </COREF> 發展為重點,包括 <COREF ID="31" REF="29" TYPE="IDENT"> 北韓 </COREF> 準備部署射程 可涵蓋幾乎日本全境的「蘆洞」一號長程飛彈 的報導。

Page 12: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 12

IE Evaluation in MUC-7 (1998)• Named Entity Task [NE]: Insert SGML tags into

the text to mark each string that represents a person, organization, or location name, or a date or time stamp, or a currency or percentage figure

• Multi-lingual Entity Task [MET]: NE task for Chinese and Japanese

• Co-reference Task [CO]: Capture information on co-referring expressions: all mentions of a given entity, including those tagged in NE, TE tasks

Page 13: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 13

IE Evaluation in MUC-7 (cont.)• Template Element Task [TE]: Extract basic

information related to organization, person, and artifact entities, drawing evidence from anywhere in the text

• Template Relation Task [TR]: Extract relational information on employee_of, manufacture_of, and location_of relations

• Scenario Template Task [ST]: Extract pre-specified event information and relate the event information to particular organization, person, or artifact entities involved in the event.

Page 14: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 14

Chinese IE Technologies

• Segmentation

• Named Entity Extraction

• Part of Speech/Sense Tagging

• Full/Partial Parsing

• Co-Reference Resolution

Page 15: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 15

Segmentation

Page 16: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 16

Segmentation

• Problem– A Chinese sentence is composed of characters

without word boundary–這名記者會說國語。

• 這 名 記者 會 說 國語。• 這 名 記者會 說 國語。

• Word Definition– A character string with an independent meaning

and a specific syntactic function

Page 17: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 17

Segmentation

• Standard– China 【信息處理用現代漢語分詞規範】

• Implemented in 1988

• National standard in 1992 (GB/T13715-92)

– Taiwan 【資訊處理用中文分詞標準草案】• Proposed by ROCLING in 1996

• National standard in 1999 (CNS14366)

Page 18: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 18

Segmentation Strategies• Dictionary is an important resource

– List “all” possible words– Find the most “plausible” path from a word lattice–把他的確實行動作了分析

– 電子計算機是會計算題目的機器

電 子 計 算 機 是 會 計 算 題 目 的 機 器

把 他 的 確 實 行 動 作 了 分 析

Page 19: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 19

Segmentation Strategies (Continued)

• Disambiguation: Select the best combination– Rule-based

• Longest-word first台灣大學 是 有名 的 學府長詞遮蔽短詞: *這 名 記者會 說 國語。

• Delete the discontinuous fragments• Other heuristic rules: 2-3 words preference, ...• parser

– Statistics-based• Markov models, relaxation method, and so on

Page 20: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 20

Segmentation Strategies

• Dictionary Coverage– Dictionary cannot cover “all” the words– solutions

• Morphological rules

• (semi-)automatic construction of dictionaries: automatic terminology extraction

• Unknown word resolution

Page 21: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 21

Morphological Rules

• numeral + classifier+classifier– 一個個 , 一條條

• date + time–八十五年十月四日

• noun (or verb) prefix/suffix– 學生們

• special verbs–丟丟 看,吃吃 看,寫寫 看–高高興興,歡歡喜喜,漂漂亮亮,迷迷糊糊–打打球,跑跑步,寫寫字

• ...

Page 22: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 22

Term Extraction: n-gram Approach

• Compute n-grams from a corpus• Select candidate terms

– Successor variety • the successor variety will sharply increase until a segment

boundary is reached• Use i-grams and (i+1)-grams to select candidate terms of length i

– Mutual Information

– Significance Estimation Function

)()(

),(log2 yPxP

yxP

cba

c

fff

f

nnn cccccccbccca ......... 2132121

Page 23: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 23

Named Entity Extraction

Page 24: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 24

Named Entities Extraction

• Five basic components in a document– People, affairs, time, places, things– Major unknown words

• Named Entities in MET2– Names: people, organizations, locations– Number: monetary/percentage expressions– Time: data/time expressions

Page 25: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 25

Named People Extraction• Chinese person names

– Chinese person names are composed of surnames and names.– Most Chinese surnames are single character and some rare ones

are two characters.– Most names are two characters and some rare ones are single

characters (in Taiwan)– The length of Chinese person names ranges from 2 to 6 characters.

• Transliterated person names– Transliterated person names denote foreigners.– The length of transliterated person names is not restricted to 2 to 6

characters.

Page 26: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 26

Named People Extraction:Chinese Person Names

• Extraction Strategies– baseline models: name-formulation statistics

• Propose possible candidates.

– context cues• Add extra scores to the candidates.• When a title appears before (after) a string, it is probably a person na

me.• Person names usually appear at the head or the tail of a sentence.• Persons may be accompanied with speech-act verbs like " 發言 ",

" 說 ", " 提出 ", etc.

– cache: occurrences of named people• A candidate appearing more than once has high tendency to be a pers

on name.

Page 27: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 27

Structure of Chinese Personal Names

• Chinese surnames have the following three types– Single character like '趙 ', '錢 ', '孫 ', ' 李 '– Two characters like '歐陽 ' and ' 上官 '– Two surnames together like '蔣宋 '

• Most names have the following two types– Single character– Two characters

Page 28: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 28

Training Data• Name-formulation statistics is trained from 1-million per

son name corpus in Taiwan.• Each contains surname, name and sex.• There are 489,305 male names, and 509,110 female nam

es.• Total 598 surnames are retrieved from this 1-M corpus.• The surnames of very low frequency like “ 是” , “那” ,

etc., are removed to avoid false alarms.• Only 541 surnames are left, and are used to trigger the pe

rson name extraction system.

Page 29: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 29

Training Data• The probability of a Chinese character to be the first

character (the second character) of a name is computed for male and female, separately.

• We compute the probabilities using training tables for female and male, respectively.

• Either male score or female score may be greater than thresholds.

• In some cases, female score may be greater than male score.• Thresholds are defined as: 99% of training data should pass

the thresholds.

Page 30: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 30

Baseline Models: name-formulation statistics• Model 1. Single character, e.g., ‘趙’ , 錢‘ , ’孫‘ and ’ 李’

– P(C1)*P(C2)*P(C3) using the training table for male > Threshold1 and P(C2)*P(C3) using training table for male > Threshold2, or

– P(C1)*P(C2)*P(C3) using the training table for female > Threshold3 andP(C2)*P(C3) using the training table for female > Threshold4

• Model 2. Two characters, e.g., ‘歐陽’ and ‘ 上官’– P(C2)*P(C3) using training table for male > Threshold2, or – P(C2)*P(C3) using training table for female > Threshold4

• Model 3. Two surnames together like '蔣宋’– P(C12)*P(C2)*P(C3) using the training table for female > Threshold3,

P(C2)*P(C3) using the training table for female > Threshold4 andP(C12)*P(C2)*P(C3) using the training table for female >P(C12)*P(C2)*P(C3) using training table for male

Page 31: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 31

Cues from Character Levels• Gender

– A married woman may add her husband's surname before her surname. That forms type 3 person names.

– Because a surname may be considered as a name, the candidates with two surnames do not always belong to the type 3 person names.

– The gender information helps us disambiguate this type of person names.

– Some Chinese characters have high score for male and some for female. The following shows some examples.

– Male : 豪、霸、宏、志、斌、彬、強、正、昌、輝、雄– Female : 佩、月、玉、如、君、秀、佳、怡、芬、芳、女

Page 32: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 32

Cues from Sentence Levels• Titles

– When a title appears before (after) a candidate, it is probable a person name. It can help to decide the boundary of a name.

–總統陳水扁 vs. 總統向青年學子 ...

• Mutual Information– How to tell if a word is a content word or a name is indispensable.–陳家世清白,決不會犯法。– When there exists a strong relationship between surrounding words, the candid

ate word has a high probability to be a content word.

• Punctuation Marks– When a candidate is located at the end of a sentence, we give it an extra score.– If words around the caesura mark, then they have similar types.

Page 33: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 33

Cues from Passage/Document Level: Cache

• A person name may appear more than once in a paragraph.

• There are four cases when cache is used.– (1) C1C2C3 and C1C2C4 are both in the cache, and C1C2 is

correct.– (2) C1C2C3 and C1C2C4 are both in the cache, and both are

correct.– (3) C1C2C3 and C1C2 are both in the cache, and C1C2C3 is

correct.– (4) C1C2C3 and C1C2 are both in the cache, and C1C2 is

correct.

Page 34: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 34

Cache• The problem using cache is case selection.• For every entry in the cache, we assign it a weight.

– The entry with clear right boundary has a high weight.• title and punctuation

– The other entries are assigned low level weight.

• The use of weight in case selection– high vs. high ==> case (2)– high vs. low or low vs. high ==> high is correct– low vs. low

• check the score of the last character of the name part• 邱永漢 邱永強• 李鵬常 李鵬及

Page 35: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 35

Discussion

• Some typical types of errors.– foreign names (e.g., 魏斯特 , 艾琳達 )

• They are identified as proper nouns correctly, but are assigned wrong features.

• About 20% of errors belong to this type.

– rare surnames (e.g.,應 , 伊 , 鳳 ) or artists' stage names.

• Near 14% of errors come from this type.

– others• Other proper nouns (place names, organization names, etc.)• identification errors

Page 36: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 36

Omitted Name Problem

• Some texts usually omit name part and leave surname only.–陳踢了王一腳

• Strategies– If this candidate appears before in the same paragraph, it is an omit

ted name.– If this candidate has a special title like “嫌、妻、老、女” or a

general title like “ 立委、教授、 ...”, then it is an omitted name.– If two single characters have very high probability to be surnames,

and they appear around caesura mark, then they are regarded as omitted names.

Page 37: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 37

Transliterated Person Names

• Challenging Issues– No special cue like surnames in Chinese person

names to trigger the recognition system.– No restriction on the length of a transliterated p

erson name.– No large scale transliterated personal name corp

us– Ambiguity in classification. ' 華盛頓 ' may den

ote a city or a former American president.

Page 38: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 38

Strategy (1)

• Character Condition– When a foreign name is transliterated, the selection of

homophones is restrictive. Richard Macs: 理查馬克斯 vs. 娌茶碼剋鷥

– Basic character set can be trained from a transliterated name corpus.

– If all the characters in a string belong to this set, they are regarded as a candidate.

Page 39: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 39

Strategy (2)

• Syllable Condition– Some characters which meet the character condition do

not look like transliterated names.

– Syllable Sequence

– Simplified Condition• (1) For each candidate, we check the syllable of the first (the

last) character.

• (2) If the syllable does not belong to the training corpus, the character is deleted.

• (3) The remaining characters are treated in the similar way.

Page 40: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 40

Strategy (3)

• Frequency Condition– For each candidate which has only two

characters, we compute the frequency of these two characters to see if it is larger than a threshold.

– The threshold is determined in the similar way as the baseline model of Chinese person names.

Page 41: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 41

Cues around Names

• Cues within Transliterated Names– Character Condition– Syllable Condition– Frequency Condition

• Cues around Transliterated Names– titles: the same as Chinese person names– name introducers: "叫 ", "叫作 ", "叫做 ", "名叫 ", and

"尊稱 "– special verbs: the same as Chinese person names

• first name middle name last name․ ․

Page 42: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 42

Discussion

• Some transliterated person names may be identified by the Chinese person name extraction system.

–魏斯特 愛琳達• Some nouns may look like transliterated person names.

– popular brands of automobiles, e.g., ' 飛雅特 ' and '雪佛蘭 '– Chinese proper nouns, e.g., ' 利多 ', '連拉 ' and ' 華隆 '– Chinese person names, e.g., '朱士列 '

• Besides the above nouns, the boundary errors affect the precision too.

• (拉 )瑞強森

Page 43: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 43

Named Organization Extraction

• A complete organization name can be divided into two parts: name and keyword. – Example: 台北市政府– Many words can serve as names, but only some fix

ed words can serve as keywords.– Challenging Issues

• (1) a keyword is usually a common content word.• (2) a keyword may appear in the abbreviated form.• (3) the keyword may be omitted completely.

Page 44: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 44

Classification of Organization Names

• Complete organization names– This type of organization names is usually composed of proper nouns

and keywords.– Some organization names are very long, thus (left) boundary determin

ation is difficult.– Some organization names with keywords are still ambiguous.

• '聯合報 ' usually denotes reading matters, but not organizations.

• Incomplete organization names– These organization names often omit their keywords.– The abbreviated organization names may be ambiguous. – '兄弟 ' and ' 公牛 ' are famous sport teams in Taiwan and in USA, res

pectively, however, they are also common content words.

Page 45: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 45

Strategies• Keywords

– A keyword shows not only the possibility of an occurrence of an organization name, but also its right boundary.

• Prefix– Prefix is a good marker for possible left boundary.

• Single-character words– If the character preceding a possible keyword is a single-character word,

then the content word is not a keyword.– If the characters preceding a possible keyword cannot exist

independently, they form a name part of an organization.

• Words of at least two characters– The words to compose a name part usually have strong relationships.

Page 46: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 46

Strategies• Parts of speech

– The name part of an organization cannot extend beyond a transitive verb.

– Numeral and classifier are also helpful.

• Cache– problem: when should a pattern be put into cache?– Character set is incomplete.

• n-gram model– It must consist of a name and an organization name keyword.– Its length must be greater than 2 words.– It does not cross any punctuation marks.– It must occur more than a threshold.

Page 47: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 47

Handcrafted Rules– OrganizationName OrganizationName OrganizationNameKeyword

e.g., 聯合國 部隊– OrganizationName CountryName OrganizationNameKeyword

e.g., 美國 大使館– OrganizationName PersonName OrganizationNameKeyword

e.g., 羅慧夫 基金會– OrganizationName CountryName OrganizationName

e.g., 美國 國防部– OrganizationName LocationName OrgnizationName

e.g., 伊利諾州 州府– OrganizationName CountryName {D|DD} OrganizationNameKeyword

e.g., 中國 國際 廣播電台– OrganizationName PersonName {D|D} OrganizationNameKeyword

e.g., 羅慧夫 文教 基金會– OrganizationName LocationName {D|D} OrganizationNameKeyword

e.g., 台北 國際 廣播電台

Page 48: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 48

Discussion• Most errors result from organization names without keywor

ds.–金匯通 復華 大公投顧–兄弟 太陽 烈火

• Identification errors– Even if keywords appear, organization names do not always exist.

• 上市公司 各國大學– Error left boundary is also a problem.

• 不為國安局 (基督 ) 長老教會• Ambiguities

–聯合報 天下雜誌

Page 49: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 49

Application of Gender Assignment

• Anaphora resolution" 問華德教授,他說那是正常的師生戀,既然雙方都是獨身男女,總不會不准談戀愛吧。至於後來趙靜雯去了那裡,為甚麼失蹤,他一概不知,並輕描淡寫的說:「也許加拿大不適合她,跑回臺灣去了。」 "

– Gender of a person name is useful for this problem.– The correct rate for gender assignment is 89%.

• Co-Reference resolution

Page 50: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 50

Named Location Extraction• A location name is composed of name and keyword p

arts.• Rules

– LocationName PersonName LocationNameKeyword– LocationName LocationName LocationNameKeyword

• Locative verbs like '來自 ', '前往 ', and so on, are introduced to treat location names without keywords.

• Cache and n-gram models are also employed to extract location names.

Page 51: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 51

Date Expressions• DATE NUMBER YEAR ( 三 年 )• DATE NUMBER MTHUNIT ( 十 月 )• DATE NUMBER DUNIT ( 五 日 )• DATE REGINC ( 元旦 )• DATE FSTATE DATE ( 今年 三月 )• DATE COMMON DATE (前 兩年 )• DATE REGINE DATE (民國 七十八年 )• DATE DATE DMONTH ( 今年 三月 )• DATE DATE BSTATE (去年 初 )• DATE FSTATEDATE DATE ( 這年 三月底 )• DATE FSTATEDATE DMONTH ( 今年 元月 )• DATE FSTATEDATE FSTATEDATE (明年 今天 ) • DATE DATE YXY DATE (去年三月 至 今年五月 )

Page 52: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 52

Time Expressions

• TIME NUMBER HUNIT ( 五 時 )

• TIME NUMBER MUNIT ( 三十 分 )

• TIME NUMBER SUNIT ( 六 秒 )

• TIME FSTAETIME TIME

• TIME FSTATE TIME

• TIME TIME BSTATE

• TIME MORN BSTATE

• TIME TIME TIME

• TIME TIME YXY TIME ( 今天 到 明天 )

• TIME NUMBER COLON NUMBER (03 : 45)

Page 53: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 53

Monetary Expressions

• DMONEY MOUNIT NUMBER MOUNIT ( 美金 五 元 )

• DMONEY NUMBER MOUNIT MOUNIT ( 五 元 美金 )

• DMONEY NUMBER MOUNIT ( 五 元 )

• DMONEY MOUNIT MOUNIT NUMBER ( 美金 $ 5)

• DMONEY MOUNIT NUMBER ($ 5)

• DMONEY NUMBER YXY DMONEY ( 三 至 五元 )

• DMONEY DMONEY YXY DMONEY ( 三元 至 五元 )

• DMONEY DMONEY YXY NUMBER ($200 - 500)

Page 54: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 54

Percentage Expressions

• DPERCENT PERCENT NUMBER ( 百分之 十 )

• DPERCENT NUMBER PERCENT (3 %)

• DPERCENT DPERCENT YXY DPERCENT (5% 到 8%)

• DPERCENT DPERCENT YXY NUMBER ( 百分之八 到 十 )

• DPERCENT NUMBER YXY DPERCENT (八 到 十百分點 )

Page 55: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 55

Named Entity Extraction in MET2• Transform Chinese texts in GB codes into texts in Big-5

codes.• Segment Chinese texts into a sequence of tokens.• Identify named people.• Identify named organizations.• Identify named locations.• Use n-gram model to identify named

organizations/locations.• Identify the rest of named expressions.• Transform the results in Big-5 codes into the results in GB

codes.

Page 56: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 56

from GB codes to Big-5 codes• Big-5 traditional character set and GB simplified character set ar

e adopted in Taiwan and in China, respectively.• Our system is developed on the basis of Big-5 codes, so that the t

ransformation is required.• Characters used both in simplified character set and in tradition c

haracter set always result in error mapping.– 旅遊 vs. 旅游 報導 vs. 報道 最後 vs. 最后那麼 vs. 那么 準確 vs. 准確 並不是 vs. 并不是幾十年 vs. 几十年 好像 vs. 好象 由於 vs. 由于長時間裡 vs. 長時間里 and so on.

• More unknown words may be generated.

Page 57: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 57

Segmentation• We list all the possible words by dictionary look-up, and then r

esolve ambiguities by segmentation strategies.• The test documents in MET-2 are selected from China newspap

ers.• Our dictionary is trained from Taiwan corpora.• Due to the different vocabulary sets, many more unknown wor

ds may be introduced.“ 人工智慧” vs. “ 人工智能” , “軟體” vs. “軟件” , “肯亞” vs. “肯尼亞” , “ 紐西蘭” vs. “新西蘭” , etc.

• The unknown words from different code sets and different vocabulary sets make named entity extraction more challenging.

Page 58: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 58

MET-2 Formal Run of NTUNLPL

• F-measures– P&R: 79.61%– 2P&R: 77.88%– P&2R: 81.42%

• Recall and Precision– name: (85%, 79%)– number: (91%, 98%)– time: (95%, 85%)

Page 59: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 59

Named Persons

• The recall rate and the precision are 91% and 74%.• Major errors

– segmentation, e.g., 盛世良 -> 盛世 良Part of person names may be regarded as a word during segmentation.

– surname name, character set and title are incomplete, e.g.,

肖成林 , 卡拉 捷 耶夫 , 醫生 卡庫– blanks, e.g., 羅 俏

We cannot tell if blanks exist in the original documents or are inserted by segmentation system.

– Boundary errors– Japanese names, e.g., 田中真紀子

Page 60: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 60

Evaluation: Named Organization

• The recall rate and the precision rate are 78% and 85%.• Major errors

– more than two content words between name and keyworde.g., 中國 衛星 發射 代理 公司

– absent of keywordse.g., 巴解法塔賀武裝

– absent of name part the name part do not satisfy character condition, e.g., 亞星公司

– n-gram errorse.g., 安得拉邦東南部發射基地

Page 61: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 61

Evaluation: Named Locations• The recall rate and the precision rate are 78% and 69%.• character set

– The characters "鹿 " and "島 " in the string "鹿兒島縣 " do not belong to our transliterated character set.

• wrong keyword– The character "部 " is an organization keyword. Thus the string "菲律賓馬部 " is mis-regarded as an organization name.

• common content words– The words such as "太陽 ", "土星 ", etc., are common content words. We do

not give them special tags.

• single-character locations– The single-character locations such as " 中 ", " 日 ", and so on, are missed durin

g recognition.

Page 62: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 62

Evaluation: Time/Date Expressions

• The recall rate and the precision rate for date expression, time expression, monetary expression and percentage expression are (94%, 88%), (98%, 70%), (98%, 98%), and (83%, 98%), respectively.

• Major errors– propagation errors

• segmentation before entity extraction, e.g., “迄今”• named people extraction before date expressions

– absent date units• the date unit does not appear, e.g., “ 一九九六”• the date unit should appear, e.g., “ 九月十”

Page 63: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 63

– Absent keywords• Some keywords are not listed.

• E.g., “ 上午莫斯科時間 8 點 58 分” is divided into “ 上午” , “8 點 58分”

– Rule coverage• E.g., “ 今、明兩年”

– Ambiguity• Some characters like “ 點” can be used in time and monetary expressions.

E.g., “ 十二點七七億美元” is divided into two parts: “ 十二點” and “ 七七億美元”

• The strings " 十分 " and " 一時 " are words. In our pipelined model, " 九點十分 " and "下午一時 " will be missed.

Page 64: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 64

Issues

• Deal with the errors propagated from the previous modules– Pipelining model vs. interleaving model

• Deal with the errors resulting from rule coverage– Handcrafted rules vs. learning rules

• Deal with the errors resulting from segmentation standards– Vocabulary set of Academic Sinica vs. Peking University

Page 65: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 65

Pipelining Model

segmentation named entity extraction

named

people

named

location

named

organization

number

date/time

only one result

input outputambiguity

resolution

Page 66: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 66

Interleaving Model

table

lookup

named

people

named

locations

named

organization

number

date/time

ambiguity resolution

input

output

Page 67: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 67

An Example in Interleaving Model

王 國 偉 立 刻 作

Page 68: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 68

Learning Rules vs. Hand-Crafted Rules

• Collect organization names.• Extract Patterns

– Clustering organization names based on keywor

ds

– Assign features in name parts

– Employ Teiresias algorithm to extract patterns

(http://cbcsrv.watson.ibm.com/Tspd.html)

Page 69: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 69

Teiresias algorithm• Patterns consist of words and wild card (*), e.g.,

[ 台北 商業 協會 ]

[ 台北 餐飲 協會 ]

=> 台北 * 協會

• parameter setting

– L: the least number of non-wild cards

– W: the maximum number of L non-wild cards

– T: confidence level, I.e., how many training instances this pattern m

ust satisfy

Page 70: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 70

Keyword Set

• Extracting keyword set

– Input all the training instances (i.e., organization names)

into Teiresias algorithm.

– Let confidence level be 5.

– Find all the patterns not ending with wild card,

e.g., ( * * 公司 ) ( * * 陣線 ) ( * 協 會 )– Regard suffix of a pattern as a keyword.

e.g., 基金會 聯誼會 處 會議 會社 司令部 總會 軍事委員會

Page 71: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 71

Features of Patterns

• types– named entities

• named people

• named locations

• named organizations

• date expression ( 三十一日國家地震救災委員會 )

• number ( 87水災臨時救災委員會 )

– common nouns

Page 72: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 72

Tagging

Page 73: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 73

Tagging

• Lexical level– Part of Speech Tagging– Named entity tagging– Sense Tagging

• Syntactic level– Syntactic Category (Structure) Tagging

• Discourse level– Anaphora-Antecedent Tagging– Co-Reference Tagging

Page 74: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 74

Part-of-Speech Tagging

• Issues of tagging accuracy– the amount of training data– the granularity of the tagging set– the occurrences of unknown words, and so on.

• Academia Sinica Balanced Corpus– 5 million words– 46 tags

• Language Models, e.g., bigram, trigram, …

Page 75: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 75

Sense Tagging

• Assign sense labels to words in a sentence.

• Sense Tagging Set– tong2yi4ci2ci2lin2 (同義詞詞林 , Cilin)– 12 large categories– 94 middle categories– 1,428 small categories – 3,925 word clusters

Page 76: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 76

A PeopleAa a collective name01       Human being The people Everybody02       I We03       You You04      He/She They05      Myself Others Someone06      WhoAb people of all ages and both sexes01 A Man A Woman Men and Women02 An Old Person An Adult The old and the young03        A Teenager04        An Infant A ChildAc posture01 A Tall Person A Dwarf02 A Fat Person A Thin Person03 A Beautiful Woman A Handsome Man

Page 77: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 77

A. PERSON (人): Aa. general name (泛稱), Ab. people of all ages and both sexes (男女老少), Ac. posture (體態), Ad. nationality/citizenship (籍屬), Ae. occupation (職業), Af.

identity (身分), Ag. situation (狀況), Ah. relative/family dependents (親人/眷屬), Ai. rank in the family (輩次), Aj. relationship (關係), Ak. morality (品行), Al. ability and insight

(才識), Am. religion (信仰), An. comic/clown type (丑類)

B. THING (物): Ba. generally called (統稱), Bb. (擬狀物), Bc. part of an object (物體的部分), Bd. a celestial body (天體), Be. terrian features (地貌), Bf. meteorological

phonomena (氣象), Bg. natural substance (自然物), Bh. plant (植物), Bi. animals (動物), Bj. micro-organism (微生物), Bk. the whole body (全身), Bl. secretions/excretions (排泄

物/分泌物), Bm. Material (材料), Bn. Building (建築物), Bo. machines and tools (機具), Bp. appliances (用品), Bq. Clothing (衣物), Br. edibles/medicines/drugs (食品/藥物/毒

品)

C. TIME AND SPACE (時間與空間): Ca. time (時間), Cb. space (空間)

D. ABSTRACT THINGS (抽象事物): Da. event/circumstances (事情/情況), Db. reason/logic (事理), Dc. looks (外貌), Dd. functions/properties (性能), De. character/ability (性

格/才能), Df. conscious (意識), Dg. analogical thing (比喻物), Dh. imaginary things (臆想物), Di. society/politics (社會/政法), Dj. economy (經濟), Dk. culture and education (文

教), Dl. disease (疾病), Dm. Organization (機構), Dn. quantity/unit (數量/單位)

E. CHARATERISTICS (特徵): Ea. external form (外形), Eb. surface looks/seeming (表象), Ec. color/taste (顏色/味道), Ed. Property (性質), Ee. virtue and ability (德才), Ef.

Circumstances (境況)

F. MOTION (動作). Fa. motion of upper limbs (hands) (上肢動作), Fb. motion of lower limbs (legs) (下肢動作), Fc. motion of head (頭部動作), Fd. motion of the whole body

(全身動作)

G. PSYCHOLOGICAL ACTIVITY (心理活動): Ga. state of mind (心理狀態), Gb. activity of mind (心理活動), Gc. capability and willingness (能/願)

H. ACTIVITY (活動): Ha. political activity (政治活動), Hb. military activity (軍事活動), Hc. administrative management (行政管理), Hd. Production (生產), He. economical

activity (經濟活動), Hf. communications and transportation (交通運輸), Hg. education and hygiene scientific research (教衛科研), Hh. recreational and sports activities (文體活

動), Hi. social contact (社交), Hj. Life (生活), Hk. religionary activity (宗教活動), Hl. superstitious belief activity (迷信活動), Hm. public security and judicature (公安/司法),

Hn. wicked behavior (惡行)

I. PHENOMENON AND CONDITION (現象與狀態): Ia. natural phenomena (自然現象), Ib. physiology phenomena (生理現象), Ic. facial expression (表情), Id. object status

(物體狀態), Ie. Situation (事態), If. circumstances (mostly unlucky) (境遇), Ig. the beginning and the end (始末), Ih. Change (變化)

J. TO BE RELATED (關聯): Ja. association (聯繫), Jb. similarities and dissimilarities (異同), Jc. to operate in coordination (配合), Jd. existence (存在), Je. Influence (影響)

K. AUXILIARY PHRASE (助語): Ka. quantitative modifier (疏狀), Kb. preposition (中介), Kc. conjunction (聯接), Kd. auxiliary (輔助), Ke. interjection (呼嘆), Kf.

Onomatopoeia (擬聲)

L. GREETINGS (敬語)

Page 78: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 78

Degree of Polysemy in Mandarin Chinese

• Small categories of Cilin are used to compute the distribution of word senses.

• ASBC is employed to count frequency of a word

• Total 28,321 word types appear both in Cilin and in ASBC corpus.

• Total 5,922 words are polysemous.

Page 79: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 79

Table 1. The Distribution of Word Senses

Low Ambiguity Middle Ambiguity High Ambiguity

Degree #Word Types Degree #Word

Types Degree

#Word

Types Degree

#Word

Types

2 4261 (71.95%) 5 186 (3.14%) 9 14

(0.24%) 14 1 (0.02%)

3 948 (16.01%) 6 77 (1.30%) 10 8 (0.14%) 15 1 (0.02%)

4 344 (5.81%) 7 42 (0.71%) 11 3 (0.05%) 17 1 (0.02%)

8 25 (0.42%) 12 4 (0.07%) 18 1 (0.02%)

13 5 (0.08%) 20 1 (0.02%)

Sum 5553 (93.77%) Sum 330 (5.57%) Sum 39 (0.66%)

Total Word Types 5922

degree: number of senses of a word word type: a dictionary entry

Page 80: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Table 2. The Distribution of Word Senses with Consideration of POS

POS

Degree N V A F K

2 1441 (81.05%) 1056 (71.79%) 580 (79.67%) 14 (77.78%) 101 (73.72%)

3 238 (13.39%) 238 (16.18%) 115 (15.80%) 4 (22.22%) 25 (18.25%) Low

4 55 (3.09%) 99 (6.73%) 20 (2.75%) 7 (5.11%)

5 26 (1.46%) 41 (2.79%) 9 (1.24%) 3 (2.19%)

6 12 (0.67%) 13 (0.88%) 2 (0.27%) 1 (0.73%)

7 3 (0.17%) 13 (0.88%) 2 (0.27%) Middle

8 2 (0.11%) 6 (0.40%)

9 1 (0.06%) 1 (0.07%)

11 1 (0.07%)

12 1 (0.07%)

13 1 (0.07%)

High

19 1 (0.07%)

Total Word

Types 1778 1471 728 18 137

4132

5922

97.53%94.70%for

N and V

98.22%97.08%

for

A and K

Page 81: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 81

Table 3. The Distribution of Word Senses with Consideration of Frequencies

Low Ambiguity Middle Ambiguity High Ambiguity

Types

Tokens

#Token/

#Types

Types

Tokens

#Token/

#Types

Types

Tokens

#Tokens

#Types

5553 1143686 205.96 330 635796 1926.65 39 174731 4480.28

93.77% 58.52% 5.57% 32.53% 0.66% 8.94%

word token: an occurrence of word type in ASBC corpusword type: a dictionary entry

93.77% of polysemous words belong to the class of low ambiguity

they only occupy 58.52% of tokens in ASBC corpus

Page 82: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 82

Table 4. The Distribution of Word Senses and Frequencies with Consideration of POS

Frequency

Ambiguity Low Middle High Sum Percentage

Types (C) 3112 734 147 3993 96.64% Tokens (A) 70131 230955 735819 1036905 85.52% Low

A/C 22.54 314.65 5005.57 259.68

Types (C) 42 62 29 133 3.22% Tokens (A) 1905 14667 153307 169879 14.01% Middle

A/C 45.36 236.56 5286.45 1277.29

Types (C) 0 2 4 6 0.15% Tokens (A) 0 843 4847 5690 0.47% High

A/C 0 421.5 1211.75 948.33

Types (C) 3154 798 180 4132 Sum Tokens (A) 72036 246465 893973 1212474

A/C 22.84 308.85 4966.52

Types (C) 76.33% 19.31% 4.36% % Tokens (A) 5.94% 20.33% 73.73%

Low frequency (< 100), Middle frequency (100…<1000), High frequency (1000)

2~4

5~8

>8

Page 83: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 83

Phenomena

• POS information reduces the degree of ambiguities – Total 8.94% of word tokens are high ambiguous in

Table 3. It decreases to 0.47% in Table 4.

• High ambiguous words tend to be high frequent – 23.67% of word types are middle- or high-frequent

words, and they occupy 94.06% of word tokens

Page 84: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 84

Semantic Tagging Unambiguous Words

• acquire the context for each semantic tag starting from the unambiguous words

CilinASBC

UnambiguousWords

those words that have only onesense in Cilin

Page 85: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 85

Acquire Contextual Vectors• An unambiguous word is characterized by the surr

ounding words.• The window size is set to 6, and the stop words are

removed. • A sense tag Ctag is in terms of a vector (w1, w2, ...,

wn) – MI metric

– EM metric

NcwfCtagf

cwCtagfcwPCtagP

cwCtagPcwCtagMI ,

log,

log, 22

0,),(,

max,cwfCtagf

cwCtagEncwCtagfcwCtagem

N

cwfCtagfcwCtagEn ,

Page 86: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 86

Semantic Tagging Ambiguous Words

• apply the information trained at the first stage to selecting the best sense tag from the candidates of each ambiguous word

AmbiguousWords

ASBC Cilin

UnambiguousWords

Those words that havemore than one sense in Cilin

Page 87: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 87

Apply and Retrain Contextual Vectors

• Identify the context vector of an ambiguous word.

• Measure the similarity between a sense vector and a context vector by cosine formula.

• Select the sense tag of the highest similarity score.

• Retrain the sense vector for each sense tag.

Page 88: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 88

Semantic Tagging Unknown Words

• Adopt outside evidences from the mapping among WordNet synsets and Cilin sense tags to narrow down the candidate set.

UnambiguousWords

AmbiguousWords

CilinASBC

Unknown

Words

Page 89: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 89

Candidate List

Cw

ew1

ew2

ew3

syn11

syn12

syn21

syn22

syn23

syn31

Mapping Table

among WordNet synsets

and Cilin tags

Ctag1

Ctag2

Ctag3

C-EDict.

WordNet

vw

vwvw

,cos

Page 90: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 90

Experiments

• Test Materials– Sample documents of different categories from ASBC c

orpus

– Total 35,921 words in the test corpus

– Research associates tag this corpus manually • Mark up the ambiguous words by looking up the Cilin dictiona

ry

• Tag the unknown words by looking up the mapping table

• The tag mapper achieves 82.52% of performance approximately

Page 91: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 91

Performance of Tagging Ambiguous Words using MI

Ambiguity

Word Tokens

Low Middle High Summary

Total Tokens 6601 3511 989 11101

Correct

Tokens 4132 1101 267 5500

Correct Rate 62.60% 31.36% 27.00% 49.55%

Page 92: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 92

Performance of Tagging Ambiguous Words using EM

Ambiguity

Word Tokens

Low Middle High Summary

Total Tokens 6601 3511 989 11101

Correct

Tokens 4223 1334 310 5867

Correct Rate 63.98% 37.99% 31.34% 52.85%

49.55%62.60% 31.36% 27.00%

Page 93: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 93

Table 8. Performance of Tagging using First-n and Middle Categories

Ambiguity

First-n/Categories Low Middle High Middle and High

Small 63.98% 37.99% 31.34% 36.53% 1

Middle 71.02% 56.19% 43.78% 53.47%

Small 60.92% 53.99% 59.40% 2

Middle 73.88% 65.72% 72.09%

Small 71.35% 67.95% 70.60% 3

Middle 79.27% 75.94% 78.53%

●The performance for tagging low ambiguity (2-4), middle ambiguity (5-8)and high ambiguity (>8) is similar (i.e., 63.98%, 60.92% and 67.95%) when 1 candidate, 2 candidates, and 3 candidates are proposed.● Under the middle categories and 1-3 proposed candidates, the performancefor tagging low, middle and high ambiguous words are 71.02%, 73.88%,

and 75.94%.

Page 94: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Table 9. Performance of Tagging Unknown Words

Categorie

s

#Tokens Baseline M1 M2 P1 P2 M1(POS) Correct 20 443 395 438 396 561 All 1633

Precision 1.22% 27.13% 24.19% 26.82% 24.25% 34.35% Correct 11 255 228 255 231 320 N 858

Precision 1.28% 29.72% 26.57% 29.72% 26.92% 37.30% Correct 5 144 124 137 120 167 V 619

Precision 0.81% 23.26% 20.03% 22.13% 19.39% 26.98% Correct 0 5 5 5 5 28 A 58

Precision 0 8.62% 8.62% 8.62% 8.62% 48.28% Correct 1 1 1 1 1 4 F 4

Precision 25.00% 25.00% 25.00% 25.00% 25.00% 100.00% Correct 3 38 37 40 39 42 K 94

Precision 3.19% 40.43% 39.36 42.55 41.49 44.68%

Training from unambiguous

words

Training from unambiguous &

ambiguous words

More restrictive mapping

table

M1 P1

Less restrictive mapping

table

M2 P2

Page 95: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 95

Co-Reference Resolution

Page 96: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 96

Introduction

• Anaphora vs. Co-Reference– Anaphora

– 同指涉包括• Type/Instance: “老師” /“張三” , “ 一個好爸爸” /“張三”

• Function/Value: “現在的氣溫” /“攝氏 30 度”• NP 的同指涉關係 : “ 一隻小花貓” /“那隻貓”

張 三 是 老 師, 他 1 教 學 很 認 真, 同 時, 他 2 也 是 一 個 好 爸 爸。 現 在 的 氣 溫 是 攝 氏 3 0 度 。

例一例二

Page 97: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 97

Flow of Co-Reference ResolutionDocument

output

Segmentation

Named Entity Extraction

Part of Speech Tagging

Find All Possible Candidates

Find Attributes of Candidates

Resolve Co-Reference

Cilin

Finite State Transition

AS Balanced Corpus

NP Chunker

Page 98: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 98

Find the Candidate List

Co-Reference Resolution Algorithm

Document

All the

Candidates

SingletonsClass

1

Class

2

Class

N

DetermineCandidates

Page 99: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 99

Find the Candidates• Select all nouns (Cand-Terms)

– Na ( 一般名詞 )– Nb (專有名詞 )– Nc (地方名詞 )– Nd ( 時間名詞 )– Nh ( 代名詞 )– Delete some Nds (total 171)

• e.g., “ 一會兒”、“稍後”、“剎那間” , “瞬間” , “ 幾時” , ...

• Select noun phrases (Cand-NP)

• Select maximal noun phrases (Cand-MaxNP)

Some are found in named entity extraction

Page 100: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 100

adj

Neu

Init

Neqa

Nes

Ncd

NaNep

DENh

Nf

Recognize NPs whose head is Na (common noun)

其他 (Neqa) 的 (DE) 廠商 (Na)其他 (Neqa) 三 (Neu) 家 (Nf) 廠商 (Na)這 (Nep) 三 (Neu) 家 (Nf) 廠商 (Na) 該 (Nes) 廠商 (Na)

Page 101: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 101

Recognize NPs whose head is Nh (pronoun)

Init

Na

Nb

Nh

塞南 (Nb) 本人 (Nh)太空船 (Na) 本身 (Nd)他 (Nh) 本人 (Nh)

Page 102: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 102

Cilin (同義詞詞林 )

• 12 large categories

• 94 middle categories

• 1,428 small categories

• 3,925 word clusters

Page 103: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 103

Features• Classification

– Word/Phrase Itself– Part of Speech of Head– Semantics of Head– Type of Named Entities– Positions (Sentences and Paragraphs)– Number: Singular, Plural and Unknown– Gender: Pronouns and Chinese Person Names– Pronouns: Personal Pronouns, Demonstrative Pronou

ns

Page 104: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 104

Co-Reference Resolution Algorithms

• Strategy 1: simple pattern matching

• Strategy 2: Cardie Clustering Algorithm

Page 105: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 105

Cardie Clustering Algorithm

Algorithm: Coreference_Clustering (NPn, NPn-1, …, NP1) 1. Let R be a radius of clustering. 2. Initially, a candidate itself forms a cluster, i.e., ci, ci = {NPi} 3. for j = n to 1 {

i = j-1 to 1 { (a) let d = dist(NPi , NPj) (b) assume ci (cj) is a class to which NPi (NPj) belongs (c) if d < R and All_NPs_Compatible (ci, cj) is true, then merge these two classes } }

Algorithm: All_NPs_Compatible (ci, cj) 1. For each NPa in ci and NPb in cj, compute dist (NPa, NPb). 2. If there exists a dist (NPa, NPb) = ∞ , then return false 3. return true

),(*),( jifFf fji NPNPilityincompatibwNPNPdist

Page 106: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

feature f wf Incompatibility function (assume NPi before NPj) string 10 total different charaters ÷ length of maximal string

if head is a pronoun, then 0 is assigned head 1 if heads are different , then 1 is assigned, else 0 is assigned position 5 relative position of two terms i.t.o. sentences ÷ total sentences +

relative position of two terms i.t.o. paragraphs ÷ total paragraphs pronoun R if head of NPi is a pronoun but the head of NPj is not a pronoun, then 1 is

assigned, else 0 is assigned substring -∞ If heads of both terms are not pronouns and head of NPj is a substring of

head of NPi, then 1 is assigned, else 0 is assigned head is NE

-∞ If heads denote the same named entity, then 1 is assigned, else 0 is assigned.

synonym -∞ If heads belong to the same word cluster, then 1 is assigned, else 0 is assigned.

NP modifier

∞ If NPi is a modifier of NPj, then 1 is assigned, else 0 is assigned.

proper name

∞ If both terms are proper names but their types are different, then 1 is assigned, else 0 is assigned. If both terms belong to same type, but they do not have any common character, then 1is assigned, else 0 is assigned.

gender ∞ If gender is different, then 1 is assigned, else 0 is assigned. number ∞ If number is different, then 1 is assigned, else 0 is assigned. semantics ∞ s, s = SemanticFun_1(NPi ,NPj ) or s = SemanticFun_2(NPi ,NPj )

Page 107: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 107

Semantics Restrictions

• SemanticFun_1(NPi ,NPj )

– If heads belong to the same word cluster, then 0 is assigned, else 1 is assigned.

• SemanticFun_2(NPi ,NPj )

– Integrate POS, Named Entity and Cilin sense• Only one is NE

• NPi and NPj are not NE, and they are in Cilin

• NPi and NPj are not NE, and only of them not in Cilin

• NPi and NPj are not NE, and both are not in Cilin 0

Page 108: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 108

SemanticFun_2– One of them is NE

• Column denotes the type of NE

• Row denotes the part of speech

• English string in a table cell denotes Cilin sense

Na A Di, Dm, Bn Bd, Be, Bn, Cb Ca Ca Dj, Dn DnNb A 1 Bd, Be, Bn, Cb 1 1 1 1Nc 1 Di, Dm, Bn Bd, Be, Bn,Cb 1 1 1 1Nd 1 1 1 Ca Ca 1 1人稱 0 1 1 1 1 1 1泛指 0 0 0 0 0 0 0Nh指物 1 0 0 1 1 1 1

人名 組織名 地名 時間 日期 金錢 百分比

Page 109: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Experimental Results18 All cues (R=1.0) 3253 4441

5486/9745 56.3

5486/14684 37.4 44.91

7 Pattern matching 3238 5283 5466/9743

56.1 5466/14696

37.2 44.73

30 Delete sub-string 3224 3651 4191/9693

43.2 4191/10985

38.2 40.54

31 Delete head=NE 3253 4441 5486/9745

56.3 5486/14684

37.4 44.91

32 Delete Synonym 3252 4372 5414/9738

55.6 5414/14162

38.2 45.31

33 Delete NP modifier 3253 4434 5493/9745

56.4 5493/14705

37.4 44.93

34 Delete Proper Nouns 3254 4424 5490/9746

56.3 5490/14840

37.0 44.66

35 Delete Gender 3253 4441 5486/9745

56.3 5486/14695

37.3 44.89

36 Delete Number 3253 4436 5496/9746

56.4 5496/14784

37.2 44.81

37 Delete Semantics 3255 4625 5679/9752

58.2 5679/16198

35.1 43.77

39 Delete Synonym +

NP modifier

3249 4134 5783/9746

59.3 5783/13353

43.3 50.07

Page 110: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 110

Named Entity Tagging Environment

Page 111: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 111

Page 112: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 112

Page 113: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 113

Page 114: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 114

tag at the same time

Page 115: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 115

Page 116: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 116

勝利 vs. 勝利工業公司

Page 117: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 117

Page 118: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 118

Named Entity Extraction in Bioinformatics Application

• Named Entities– Protein Name– Gene Name– …

Page 119: SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

Hsin-Hsi Chen (NTU) 119

Summary

• Segmentation

• Named Entity Extraction

• POS and Sense Tagging

• Co-Reference Resolution

• NE Tagging Environment

• Bioinformatics