SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National

SSIMIP-2002 (July 11)

Chapter 13

Chinese Information Extraction Technologies

Hsin-Hsi Chen

Department of Computer Science and Information Engineering

National Taiwan University

Taipei, Taiwan

E-mail: [email protected]

臺灣大學

自然語言處理實驗室資訊工程學研究所

Natural Language Processing Lab. National Taiwan University

Hsin-Hsi Chen (NTU) 2

Outline

• Introduction to Information Extraction (IE)

• Chinese IE Technologies

• Tagging Environment for Chinese IE

• Applications

• Summary


Introduction


Introduction

• Information Extraction– the extraction or pulling out of pertinent information

from large volumes of texts

• Information Extraction System– an automated system to extract pertinent

information from large volumes of text

• Information Extraction Technologies– techniques used to automatically extract specified

information from text

(http://www.itl.nist.gov/iaui/894.02/related_projects/muc/)


An Example in Air Vehicle Launch

• Original Document

• Named-Entity-Tagged Document

• Equivalence Classes

• Co-Reference Tagged Document


<DOC><DOCID> NTU-AIR_LAUNCH- 中國時報 -19970612-002 </DOCID><DATASET> Air Vehicle Launch </DATASET><DD> 1997/06/12 </DD><DOCTYPE> 報紙報導 </DOCTYPE><DOCSRC> 中國時報 </DOCSRC><TEXT>【本報綜合紐約、華盛頓十一日外電報導】在華盛頓宣布首度出售「刺針」肩射防空飛彈給南韓的第二天，美國與北韓今天在紐約恢復延擱已久的會談，這項預定三天的會談將以北韓的飛彈發展為重點，包括北韓準備部署射程可涵蓋幾乎日本全境的「蘆洞」一號長程飛彈的報導。

　美國國務院發言人柏恩斯說：「在有關北韓飛彈擴散問題上，美方的確有多項關切之處。」美國官員也長期懷疑北韓正對伊朗和敘利亞輸出飛彈，並希望平壤加入禁止擴散此種武器的

red: location nameblue: date expressiongreen: organization namepurple: person name


國際公約。美國官員已知會北韓說，倘若北韓希望與美國建立正常的外交關係，就必須減少飛彈輸出。

　這項有關北韓飛彈計劃的會談是雙方於一九九六年四月在德國柏林舉行的首度會談的後續談判。美國在該次會談中要求北韓停止生產、測試及出售飛彈給他國，尤其是敘利亞和伊朗兩國。

　美國副助理國務卿艾恩宏和北韓外交部對外事務局局長李衡哲分別為雙方的談判代表，會談預定在十三日結束。

　柏恩斯說：「美方非常關心所有北韓本身，或是北韓與中共、伊朗或其他國家的飛彈問題。我們認為就此與他們舉行會談是甚為重要。」

　而為提昇南韓陸軍的自衛能力，美國於昨天宣布準備出售價值三億零七百萬美元的一千零六十五枚刺針飛彈與其他武器給南韓，它說，這項交易不會使朝鮮半島的緊張局勢惡化。


五角大廈說：「這項設備與支援的銷售不會影響該區基本軍事均勢。」

　國務院也表示全力支持此項包含兩百一十三座發射台、支援設備、零件與訓練的交易。

　柏恩斯說：「這項交易獲得政府內每一個人的全力支持，它符合我們在朝鮮半島的政策。」他強調：「我們的第一優先是防衛南韓。」

　如果國會同意，這將是華府對南韓出售防空飛彈的第一筆交易。

</TEXT></DOC>


<DOC><DOCID> NTU-AIR_LAUNCH- 中國時報 -19970612-002 </DOCID><DATASET> Air Vehicle Launch </DATASET><DD> 1997/06/12 </DD><DOCTYPE> 報紙報導 </DOCTYPE><DOCSRC> 中國時報 </DOCSRC><ISRELEVANT> NO </ISRELEVANT><TITLE> <ENAMEX TYPE="LOCATION"> 美 </ENAMEX>擬售 <ENAMEX TYPE="LOCATION"> 南韓 </ENAMEX>1065 枚刺針飛彈 </TITLE><TEXT>

【本報綜合 <ENAMEX TYPE="LOCATION"> 紐約 </ENAMEX> 、 <ENAMEX TYPE="LOCATION"> 華盛頓 </ENAMEX><TIMEX TYPE="DATE"> 十一日 </TIMEX> 外電報導】在 <ENAMEX TYPE="LOCATION"> 華盛頓 </ENAMEX> 宣布首度出售「刺針」肩射防空飛彈給 <ENAMEX TYPE="LOCATION"> 南韓 </ENAMEX> 的 <TIMEX TYPE="DATE"> 第二天 </TIMEX> ， <ENAMEX TYPE="LOCATION"> 美國 </ENAMEX> 與 <ENAMEX TYPE="LOCATION"> 北韓 </ENAMEX><TIMEX TYPE="DATE"> 今天 </TIMEX> 在<ENAMEX TYPE="LOCATION"> 紐約 </ENAMEX> 恢復延擱已久的會談，這項預定三天的會談將以 <ENAMEX TYPE="LOCATION"> 北韓 </ENAMEX> 的飛彈發展為重點，包括 <ENAMEX TYPE="LOCATION"> 北韓 </ENAMEX> 準備部署射程可涵蓋幾乎 <ENAMEX TYPE="LOCATION"> 日本 </ENAMEX> 全境的「蘆洞」一號長程飛彈的報導。


<ID="3"> 十一日 <ID="4" REF="3" > 今天 <ID="5“ REF="3"> 出售「刺針」肩射防空飛彈給南韓的第二天

<ID="63" > 延擱已久的會談 <ID=“66” REF=“63”> 一九九六年四月在德國柏林舉行的首度會談的後續談判 <ID="65" REF="63"> 這項有關北韓飛彈計劃的會談 <ID="70" REF="65"> 會談 <ID="69" REF="65"> 會談 <ID="64" REF="63"> 這項預定三天的會談


<DOC> <DOCID> NTU-AIR_LAUNCH- 中國時報 -19970612-002 </DOCID><DATASET> Air Vehicle Launch </DATASET><DD> 1997/06/12 </DD><DOCTYPE> 報紙報導 </DOCTYPE><DOCSRC> <COREF ID="1"> 中國時報 </COREF> </DOCSRC><ISRELEVANT> NO </ISRELEVANT><TITLE> <COREF ID="6"> 美 </COREF>擬售 <COREF ID="23"> 南韓 </COREF><COREF ID="45" REF="44" TYPE="IDENT" MIN=" 刺針飛彈 ">1065 枚刺針飛彈 </COREF> </TITLE><TEXT>

【 <COREF ID="2" REF="1" TYPE="IDENT"> 本報 </COREF> 綜合 <COREF ID="61"> 紐約 </COREF> 、 <COREF ID="8" STATUS="OPT" REF="6" TYPE="IDENT"> 華盛頓 </COREF><COREF ID="3">十一日 </COREF> 外電報導】在 <COREF ID="7" REF="6" TYPE="IDENT"> 華盛頓 </COREF> 宣布首度 <COREF ID="5" STATUS="OPT" REF="3" TYPE="IDENT" MIN=" 第二天 "> 出售「刺針」肩射防空飛彈給 <COREF ID="24" REF="23" TYPE="IDENT"> 南韓 </COREF> 的第二天 </COREF> ， <COREF ID="77"><COREF ID="9" REF="6" TYPE="IDENT"> 美國 </COREF> 與 <COREF ID="29"> 北韓 </COREF></COREF><COREF ID="4" REF="3" TYPE="IDENT"> 今天 </COREF> 在 <COREF ID="62" REF="61" TYPE="IDENT"> 紐約 </COREF> 恢復 <COREF ID="63" MIN=" 會談 "> 延擱已久的會談 </COREF> ， <COREF ID="64" REF="63" TYPE="IDENT" MIN=" 會談 "> 這項預定三天的會談 </COREF> 將以 <COREF ID="81" STATUS="OPT" REF="75" TYPE="IDENT" MIN=" 飛彈 "><COREF ID="30" REF="29" TYPE="IDENT"> 北韓 </COREF> 的飛彈 </COREF> 發展為重點，包括 <COREF ID="31" REF="29" TYPE="IDENT"> 北韓 </COREF> 準備部署射程可涵蓋幾乎日本全境的「蘆洞」一號長程飛彈的報導。


IE Evaluation in MUC-7 (1998)• Named Entity Task [NE]: Insert SGML tags into

the text to mark each string that represents a person, organization, or location name, or a date or time stamp, or a currency or percentage figure

• Multi-lingual Entity Task [MET]: NE task for Chinese and Japanese

• Co-reference Task [CO]: Capture information on co-referring expressions: all mentions of a given entity, including those tagged in NE, TE tasks


IE Evaluation in MUC-7 (cont.)• Template Element Task [TE]: Extract basic

information related to organization, person, and artifact entities, drawing evidence from anywhere in the text

• Template Relation Task [TR]: Extract relational information on employee_of, manufacture_of, and location_of relations

• Scenario Template Task [ST]: Extract pre-specified event information and relate the event information to particular organization, person, or artifact entities involved in the event.


Chinese IE Technologies

• Segmentation

• Named Entity Extraction

• Part of Speech/Sense Tagging

• Full/Partial Parsing

• Co-Reference Resolution


Segmentation


Segmentation

• Problem– A Chinese sentence is composed of characters

without word boundary–這名記者會說國語。

• 這名記者會說國語。• 這名記者會說國語。

• Word Definition– A character string with an independent meaning

and a specific syntactic function


Segmentation

• Standard– China 【信息處理用現代漢語分詞規範】

• Implemented in 1988

• National standard in 1992 (GB/T13715-92)

– Taiwan 【資訊處理用中文分詞標準草案】• Proposed by ROCLING in 1996

• National standard in 1999 (CNS14366)


Segmentation Strategies• Dictionary is an important resource

– List “all” possible words– Find the most “plausible” path from a word lattice–把他的確實行動作了分析

– 電子計算機是會計算題目的機器

電子計算機是會計算題目的機器

把他的確實行動作了分析


Segmentation Strategies (Continued)

• Disambiguation: Select the best combination– Rule-based

• Longest-word first台灣大學是有名的學府長詞遮蔽短詞： *這名記者會說國語。

• Delete the discontinuous fragments• Other heuristic rules: 2-3 words preference, ...• parser

– Statistics-based• Markov models, relaxation method, and so on


Segmentation Strategies

• Dictionary Coverage– Dictionary cannot cover “all” the words– solutions

• Morphological rules

• (semi-)automatic construction of dictionaries: automatic terminology extraction

• Unknown word resolution


Morphological Rules

• numeral + classifier+classifier– 一個個 , 一條條

• date + time–八十五年十月四日

• noun (or verb) prefix/suffix– 學生們

• special verbs–丟丟看，吃吃看，寫寫看–高高興興，歡歡喜喜，漂漂亮亮，迷迷糊糊–打打球，跑跑步，寫寫字

• ...


Term Extraction: n-gram Approach

• Compute n-grams from a corpus• Select candidate terms

– Successor variety • the successor variety will sharply increase until a segment

boundary is reached• Use i-grams and (i+1)-grams to select candidate terms of length i

– Mutual Information

– Significance Estimation Function

)()(

),(log2 yPxP

yxP

cba

c

fff

f

nnn cccccccbccca ......... 2132121


Named Entity Extraction


Named Entities Extraction

• Five basic components in a document– People, affairs, time, places, things– Major unknown words

• Named Entities in MET2– Names: people, organizations, locations– Number: monetary/percentage expressions– Time: data/time expressions


Named People Extraction• Chinese person names

– Chinese person names are composed of surnames and names.– Most Chinese surnames are single character and some rare ones

are two characters.– Most names are two characters and some rare ones are single

characters (in Taiwan)– The length of Chinese person names ranges from 2 to 6 characters.

• Transliterated person names– Transliterated person names denote foreigners.– The length of transliterated person names is not restricted to 2 to 6

characters.


Named People Extraction:Chinese Person Names

• Extraction Strategies– baseline models: name-formulation statistics

• Propose possible candidates.

– context cues• Add extra scores to the candidates.• When a title appears before (after) a string, it is probably a person na

me.• Person names usually appear at the head or the tail of a sentence.• Persons may be accompanied with speech-act verbs like " 發言 ",

" 說 ", " 提出 ", etc.

– cache: occurrences of named people• A candidate appearing more than once has high tendency to be a pers

on name.


Structure of Chinese Personal Names

• Chinese surnames have the following three types– Single character like '趙 ', '錢 ', '孫 ', ' 李 '– Two characters like '歐陽 ' and ' 上官 '– Two surnames together like '蔣宋 '

• Most names have the following two types– Single character– Two characters


Training Data• Name-formulation statistics is trained from 1-million per

son name corpus in Taiwan.• Each contains surname, name and sex.• There are 489,305 male names, and 509,110 female nam

es.• Total 598 surnames are retrieved from this 1-M corpus.• The surnames of very low frequency like “ 是” , “那” ,

etc., are removed to avoid false alarms.• Only 541 surnames are left, and are used to trigger the pe

rson name extraction system.


Training Data• The probability of a Chinese character to be the first

character (the second character) of a name is computed for male and female, separately.

• We compute the probabilities using training tables for female and male, respectively.

• Either male score or female score may be greater than thresholds.

• In some cases, female score may be greater than male score.• Thresholds are defined as: 99% of training data should pass

the thresholds.


Baseline Models: name-formulation statistics• Model 1. Single character, e.g., ‘趙’ , 錢‘ , ’孫‘ and ’ 李’

– P(C1)*P(C2)*P(C3) using the training table for male > Threshold1 and P(C2)*P(C3) using training table for male > Threshold2, or

– P(C1)*P(C2)*P(C3) using the training table for female > Threshold3 andP(C2)*P(C3) using the training table for female > Threshold4

• Model 2. Two characters, e.g., ‘歐陽’ and ‘ 上官’– P(C2)*P(C3) using training table for male > Threshold2, or – P(C2)*P(C3) using training table for female > Threshold4

• Model 3. Two surnames together like '蔣宋’– P(C12)*P(C2)*P(C3) using the training table for female > Threshold3,

P(C2)*P(C3) using the training table for female > Threshold4 andP(C12)*P(C2)*P(C3) using the training table for female >P(C12)*P(C2)*P(C3) using training table for male


Cues from Character Levels• Gender

– A married woman may add her husband's surname before her surname. That forms type 3 person names.

– Because a surname may be considered as a name, the candidates with two surnames do not always belong to the type 3 person names.

– The gender information helps us disambiguate this type of person names.

– Some Chinese characters have high score for male and some for female. The following shows some examples.

– Male : 豪、霸、宏、志、斌、彬、強、正、昌、輝、雄– Female : 佩、月、玉、如、君、秀、佳、怡、芬、芳、女


Cues from Sentence Levels• Titles

– When a title appears before (after) a candidate, it is probable a person name. It can help to decide the boundary of a name.

–總統陳水扁 vs. 總統向青年學子 ...

• Mutual Information– How to tell if a word is a content word or a name is indispensable.–陳家世清白，決不會犯法。– When there exists a strong relationship between surrounding words, the candid

ate word has a high probability to be a content word.

• Punctuation Marks– When a candidate is located at the end of a sentence, we give it an extra score.– If words around the caesura mark, then they have similar types.


Cues from Passage/Document Level: Cache

• A person name may appear more than once in a paragraph.

• There are four cases when cache is used.– (1) C1C2C3 and C1C2C4 are both in the cache, and C1C2 is

correct.– (2) C1C2C3 and C1C2C4 are both in the cache, and both are

correct.– (3) C1C2C3 and C1C2 are both in the cache, and C1C2C3 is

correct.– (4) C1C2C3 and C1C2 are both in the cache, and C1C2 is

correct.


Cache• The problem using cache is case selection.• For every entry in the cache, we assign it a weight.

– The entry with clear right boundary has a high weight.• title and punctuation

– The other entries are assigned low level weight.

• The use of weight in case selection– high vs. high ==> case (2)– high vs. low or low vs. high ==> high is correct– low vs. low

• check the score of the last character of the name part• 邱永漢邱永強• 李鵬常李鵬及


Discussion

• Some typical types of errors.– foreign names (e.g., 魏斯特 , 艾琳達 )

• They are identified as proper nouns correctly, but are assigned wrong features.

• About 20% of errors belong to this type.

– rare surnames (e.g.,應 , 伊 , 鳳 ) or artists' stage names.

• Near 14% of errors come from this type.

– others• Other proper nouns (place names, organization names, etc.)• identification errors


Omitted Name Problem

• Some texts usually omit name part and leave surname only.–陳踢了王一腳

• Strategies– If this candidate appears before in the same paragraph, it is an omit

ted name.– If this candidate has a special title like “嫌、妻、老、女” or a

general title like “ 立委、教授、 ...”, then it is an omitted name.– If two single characters have very high probability to be surnames,

and they appear around caesura mark, then they are regarded as omitted names.


Transliterated Person Names

• Challenging Issues– No special cue like surnames in Chinese person

names to trigger the recognition system.– No restriction on the length of a transliterated p

erson name.– No large scale transliterated personal name corp

us– Ambiguity in classification. ' 華盛頓 ' may den

ote a city or a former American president.


Strategy (1)

• Character Condition– When a foreign name is transliterated, the selection of

homophones is restrictive. Richard Macs: 理查馬克斯 vs. 娌茶碼剋鷥

– Basic character set can be trained from a transliterated name corpus.

– If all the characters in a string belong to this set, they are regarded as a candidate.


Strategy (2)

• Syllable Condition– Some characters which meet the character condition do

not look like transliterated names.

– Syllable Sequence

– Simplified Condition• (1) For each candidate, we check the syllable of the first (the

last) character.

• (2) If the syllable does not belong to the training corpus, the character is deleted.

• (3) The remaining characters are treated in the similar way.


Strategy (3)

• Frequency Condition– For each candidate which has only two

characters, we compute the frequency of these two characters to see if it is larger than a threshold.

– The threshold is determined in the similar way as the baseline model of Chinese person names.


Cues around Names

• Cues within Transliterated Names– Character Condition– Syllable Condition– Frequency Condition

• Cues around Transliterated Names– titles: the same as Chinese person names– name introducers: "叫 ", "叫作 ", "叫做 ", "名叫 ", and

"尊稱 "– special verbs: the same as Chinese person names

• first name middle name last name․ ․


Discussion

• Some transliterated person names may be identified by the Chinese person name extraction system.

–魏斯特愛琳達• Some nouns may look like transliterated person names.

– popular brands of automobiles, e.g., ' 飛雅特 ' and '雪佛蘭 '– Chinese proper nouns, e.g., ' 利多 ', '連拉 ' and ' 華隆 '– Chinese person names, e.g., '朱士列 '

• Besides the above nouns, the boundary errors affect the precision too.

• (拉 )瑞強森


Named Organization Extraction

• A complete organization name can be divided into two parts: name and keyword. – Example: 台北市政府– Many words can serve as names, but only some fix

ed words can serve as keywords.– Challenging Issues

• (1) a keyword is usually a common content word.• (2) a keyword may appear in the abbreviated form.• (3) the keyword may be omitted completely.


Classification of Organization Names

• Complete organization names– This type of organization names is usually composed of proper nouns

and keywords.– Some organization names are very long, thus (left) boundary determin

ation is difficult.– Some organization names with keywords are still ambiguous.

• '聯合報 ' usually denotes reading matters, but not organizations.

• Incomplete organization names– These organization names often omit their keywords.– The abbreviated organization names may be ambiguous. – '兄弟 ' and ' 公牛 ' are famous sport teams in Taiwan and in USA, res

pectively, however, they are also common content words.


Strategies• Keywords

– A keyword shows not only the possibility of an occurrence of an organization name, but also its right boundary.

• Prefix– Prefix is a good marker for possible left boundary.

• Single-character words– If the character preceding a possible keyword is a single-character word,

then the content word is not a keyword.– If the characters preceding a possible keyword cannot exist

independently, they form a name part of an organization.

• Words of at least two characters– The words to compose a name part usually have strong relationships.


Strategies• Parts of speech

– The name part of an organization cannot extend beyond a transitive verb.

– Numeral and classifier are also helpful.

• Cache– problem: when should a pattern be put into cache?– Character set is incomplete.

• n-gram model– It must consist of a name and an organization name keyword.– Its length must be greater than 2 words.– It does not cross any punctuation marks.– It must occur more than a threshold.


Handcrafted Rules– OrganizationName OrganizationName OrganizationNameKeyword

e.g., 聯合國部隊– OrganizationName CountryName OrganizationNameKeyword

e.g., 美國大使館– OrganizationName PersonName OrganizationNameKeyword

e.g., 羅慧夫基金會– OrganizationName CountryName OrganizationName

e.g., 美國國防部– OrganizationName LocationName OrgnizationName

e.g., 伊利諾州州府– OrganizationName CountryName {D|DD} OrganizationNameKeyword

e.g., 中國國際廣播電台– OrganizationName PersonName {D|D} OrganizationNameKeyword

e.g., 羅慧夫文教基金會– OrganizationName LocationName {D|D} OrganizationNameKeyword

e.g., 台北國際廣播電台


Discussion• Most errors result from organization names without keywor

ds.–金匯通復華大公投顧–兄弟太陽烈火

• Identification errors– Even if keywords appear, organization names do not always exist.

• 上市公司各國大學– Error left boundary is also a problem.

• 不為國安局 (基督 ) 長老教會• Ambiguities

–聯合報天下雜誌


Application of Gender Assignment

• Anaphora resolution" 問華德教授，他說那是正常的師生戀，既然雙方都是獨身男女，總不會不准談戀愛吧。至於後來趙靜雯去了那裡，為甚麼失蹤，他一概不知，並輕描淡寫的說：「也許加拿大不適合她，跑回臺灣去了。」 "

– Gender of a person name is useful for this problem.– The correct rate for gender assignment is 89%.

• Co-Reference resolution


Named Location Extraction• A location name is composed of name and keyword p

arts.• Rules

– LocationName PersonName LocationNameKeyword– LocationName LocationName LocationNameKeyword

• Locative verbs like '來自 ', '前往 ', and so on, are introduced to treat location names without keywords.

• Cache and n-gram models are also employed to extract location names.


Date Expressions• DATE NUMBER YEAR ( 三年 )• DATE NUMBER MTHUNIT ( 十月 )• DATE NUMBER DUNIT ( 五日 )• DATE REGINC ( 元旦 )• DATE FSTATE DATE ( 今年三月 )• DATE COMMON DATE (前兩年 )• DATE REGINE DATE (民國七十八年 )• DATE DATE DMONTH ( 今年三月 )• DATE DATE BSTATE (去年初 )• DATE FSTATEDATE DATE ( 這年三月底 )• DATE FSTATEDATE DMONTH ( 今年元月 )• DATE FSTATEDATE FSTATEDATE (明年今天 ) • DATE DATE YXY DATE (去年三月至今年五月 )


Time Expressions

• TIME NUMBER HUNIT ( 五時 )

• TIME NUMBER MUNIT ( 三十分 )

• TIME NUMBER SUNIT ( 六秒 )

• TIME FSTAETIME TIME

• TIME FSTATE TIME

• TIME TIME BSTATE

• TIME MORN BSTATE

• TIME TIME TIME

• TIME TIME YXY TIME ( 今天到明天 )

• TIME NUMBER COLON NUMBER (03 ： 45)


Monetary Expressions

• DMONEY MOUNIT NUMBER MOUNIT ( 美金五元 )

• DMONEY NUMBER MOUNIT MOUNIT ( 五元美金 )

• DMONEY NUMBER MOUNIT ( 五元 )

• DMONEY MOUNIT MOUNIT NUMBER ( 美金 $ 5)

• DMONEY MOUNIT NUMBER ($ 5)

• DMONEY NUMBER YXY DMONEY ( 三至五元 )

• DMONEY DMONEY YXY DMONEY ( 三元至五元 )

• DMONEY DMONEY YXY NUMBER ($200 - 500)


Percentage Expressions

• DPERCENT PERCENT NUMBER ( 百分之十 )

• DPERCENT NUMBER PERCENT (3 %)

• DPERCENT DPERCENT YXY DPERCENT (5% 到 8%)

• DPERCENT DPERCENT YXY NUMBER ( 百分之八到十 )

• DPERCENT NUMBER YXY DPERCENT (八到十百分點 )


Named Entity Extraction in MET2• Transform Chinese texts in GB codes into texts in Big-5

codes.• Segment Chinese texts into a sequence of tokens.• Identify named people.• Identify named organizations.• Identify named locations.• Use n-gram model to identify named

organizations/locations.• Identify the rest of named expressions.• Transform the results in Big-5 codes into the results in GB

codes.


from GB codes to Big-5 codes• Big-5 traditional character set and GB simplified character set ar

e adopted in Taiwan and in China, respectively.• Our system is developed on the basis of Big-5 codes, so that the t

ransformation is required.• Characters used both in simplified character set and in tradition c

haracter set always result in error mapping.– 旅遊 vs. 旅游報導 vs. 報道最後 vs. 最后那麼 vs. 那么準確 vs. 准確並不是 vs. 并不是幾十年 vs. 几十年好像 vs. 好象由於 vs. 由于長時間裡 vs. 長時間里 and so on.

• More unknown words may be generated.


Segmentation• We list all the possible words by dictionary look-up, and then r

esolve ambiguities by segmentation strategies.• The test documents in MET-2 are selected from China newspap

ers.• Our dictionary is trained from Taiwan corpora.• Due to the different vocabulary sets, many more unknown wor

ds may be introduced.“ 人工智慧” vs. “ 人工智能” , “軟體” vs. “軟件” , “肯亞” vs. “肯尼亞” , “ 紐西蘭” vs. “新西蘭” , etc.

• The unknown words from different code sets and different vocabulary sets make named entity extraction more challenging.


MET-2 Formal Run of NTUNLPL

• F-measures– P&R: 79.61%– 2P&R: 77.88%– P&2R: 81.42%

• Recall and Precision– name: (85%, 79%)– number: (91%, 98%)– time: (95%, 85%)


Named Persons

• The recall rate and the precision are 91% and 74%.• Major errors

– segmentation, e.g., 盛世良 -> 盛世良Part of person names may be regarded as a word during segmentation.

– surname name, character set and title are incomplete, e.g.,

肖成林 , 卡拉捷耶夫 , 醫生卡庫– blanks, e.g., 羅俏

We cannot tell if blanks exist in the original documents or are inserted by segmentation system.

– Boundary errors– Japanese names, e.g., 田中真紀子


Evaluation: Named Organization

• The recall rate and the precision rate are 78% and 85%.• Major errors

– more than two content words between name and keyworde.g., 中國衛星發射代理公司

– absent of keywordse.g., 巴解法塔賀武裝

– absent of name part the name part do not satisfy character condition, e.g., 亞星公司

– n-gram errorse.g., 安得拉邦東南部發射基地


Evaluation: Named Locations• The recall rate and the precision rate are 78% and 69%.• character set

– The characters "鹿 " and "島 " in the string "鹿兒島縣 " do not belong to our transliterated character set.

• wrong keyword– The character "部 " is an organization keyword. Thus the string "菲律賓馬部 " is mis-regarded as an organization name.

• common content words– The words such as "太陽 ", "土星 ", etc., are common content words. We do

not give them special tags.

• single-character locations– The single-character locations such as " 中 ", " 日 ", and so on, are missed durin

g recognition.


Evaluation: Time/Date Expressions

• The recall rate and the precision rate for date expression, time expression, monetary expression and percentage expression are (94%, 88%), (98%, 70%), (98%, 98%), and (83%, 98%), respectively.

• Major errors– propagation errors

• segmentation before entity extraction, e.g., “迄今”• named people extraction before date expressions

– absent date units• the date unit does not appear, e.g., “ 一九九六”• the date unit should appear, e.g., “ 九月十”


– Absent keywords• Some keywords are not listed.

• E.g., “ 上午莫斯科時間 8 點 58 分” is divided into “ 上午” , “8 點 58分”

– Rule coverage• E.g., “ 今、明兩年”

– Ambiguity• Some characters like “ 點” can be used in time and monetary expressions.

E.g., “ 十二點七七億美元” is divided into two parts: “ 十二點” and “ 七七億美元”

• The strings " 十分 " and " 一時 " are words. In our pipelined model, " 九點十分 " and "下午一時 " will be missed.


Issues

• Deal with the errors propagated from the previous modules– Pipelining model vs. interleaving model

• Deal with the errors resulting from rule coverage– Handcrafted rules vs. learning rules

• Deal with the errors resulting from segmentation standards– Vocabulary set of Academic Sinica vs. Peking University


Pipelining Model

segmentation named entity extraction

named

people

named

location

named

organization

number

date/time

only one result

input outputambiguity

resolution


Interleaving Model

table

lookup

named

people

named

locations

named

organization

number

date/time

ambiguity resolution

input

output


An Example in Interleaving Model

王國偉立刻作


Learning Rules vs. Hand-Crafted Rules

• Collect organization names.• Extract Patterns

– Clustering organization names based on keywor

ds

– Assign features in name parts

– Employ Teiresias algorithm to extract patterns

(http://cbcsrv.watson.ibm.com/Tspd.html)


Teiresias algorithm• Patterns consist of words and wild card (*), e.g.,

[ 台北商業協會 ]

[ 台北餐飲協會 ]

=> 台北 * 協會

• parameter setting

– L: the least number of non-wild cards

– W: the maximum number of L non-wild cards

– T: confidence level, I.e., how many training instances this pattern m

ust satisfy


Keyword Set

• Extracting keyword set

– Input all the training instances (i.e., organization names)

into Teiresias algorithm.

– Let confidence level be 5.

– Find all the patterns not ending with wild card,

e.g., ( * * 公司 ) ( * * 陣線 ) （ * 協會）– Regard suffix of a pattern as a keyword.

e.g., 基金會聯誼會處會議會社司令部總會軍事委員會


Features of Patterns

• types– named entities

• named people

• named locations

• named organizations

• date expression （三十一日國家地震救災委員會）

• number （ 87水災臨時救災委員會 )

– common nouns


Tagging


Tagging

• Lexical level– Part of Speech Tagging– Named entity tagging– Sense Tagging

• Syntactic level– Syntactic Category (Structure) Tagging

• Discourse level– Anaphora-Antecedent Tagging– Co-Reference Tagging


Part-of-Speech Tagging

• Issues of tagging accuracy– the amount of training data– the granularity of the tagging set– the occurrences of unknown words, and so on.

• Academia Sinica Balanced Corpus– 5 million words– 46 tags

• Language Models, e.g., bigram, trigram, …


Sense Tagging

• Assign sense labels to words in a sentence.

• Sense Tagging Set– tong2yi4ci2ci2lin2 (同義詞詞林 , Cilin)– 12 large categories– 94 middle categories– 1,428 small categories – 3,925 word clusters


A PeopleAa a collective name01 Human being The people Everybody02 I We03 You You04 He/She They05 Myself Others Someone06 WhoAb people of all ages and both sexes01 A Man A Woman Men and Women02 An Old Person An Adult The old and the young03 A Teenager04 An Infant A ChildAc posture01 A Tall Person A Dwarf02 A Fat Person A Thin Person03 A Beautiful Woman A Handsome Man


A. PERSON (人): Aa. general name (泛稱), Ab. people of all ages and both sexes (男女老少), Ac. posture (體態), Ad. nationality/citizenship (籍屬), Ae. occupation (職業), Af.

identity (身分), Ag. situation (狀況), Ah. relative/family dependents (親人/眷屬), Ai. rank in the family (輩次), Aj. relationship (關係), Ak. morality (品行), Al. ability and insight

(才識), Am. religion (信仰), An. comic/clown type (丑類)

B. THING (物): Ba. generally called (統稱), Bb. (擬狀物), Bc. part of an object (物體的部分), Bd. a celestial body (天體), Be. terrian features (地貌), Bf. meteorological

phonomena (氣象), Bg. natural substance (自然物), Bh. plant (植物), Bi. animals (動物), Bj. micro-organism (微生物), Bk. the whole body (全身), Bl. secretions/excretions (排泄

物/分泌物), Bm. Material (材料), Bn. Building (建築物), Bo. machines and tools (機具), Bp. appliances (用品), Bq. Clothing (衣物), Br. edibles/medicines/drugs (食品/藥物/毒

品)

C. TIME AND SPACE (時間與空間): Ca. time (時間), Cb. space (空間)

D. ABSTRACT THINGS (抽象事物): Da. event/circumstances (事情/情況), Db. reason/logic (事理), Dc. looks (外貌), Dd. functions/properties (性能), De. character/ability (性

格/才能), Df. conscious (意識), Dg. analogical thing (比喻物), Dh. imaginary things (臆想物), Di. society/politics (社會/政法), Dj. economy (經濟), Dk. culture and education (文

教), Dl. disease (疾病), Dm. Organization (機構), Dn. quantity/unit (數量/單位)

E. CHARATERISTICS (特徵): Ea. external form (外形), Eb. surface looks/seeming (表象), Ec. color/taste (顏色/味道), Ed. Property (性質), Ee. virtue and ability (德才), Ef.

Circumstances (境況)

F. MOTION (動作). Fa. motion of upper limbs (hands) (上肢動作), Fb. motion of lower limbs (legs) (下肢動作), Fc. motion of head (頭部動作), Fd. motion of the whole body

(全身動作)

G. PSYCHOLOGICAL ACTIVITY (心理活動): Ga. state of mind (心理狀態), Gb. activity of mind (心理活動), Gc. capability and willingness (能/願)

H. ACTIVITY (活動): Ha. political activity (政治活動), Hb. military activity (軍事活動), Hc. administrative management (行政管理), Hd. Production (生產), He. economical

activity (經濟活動), Hf. communications and transportation (交通運輸), Hg. education and hygiene scientific research (教衛科研), Hh. recreational and sports activities (文體活

動), Hi. social contact (社交), Hj. Life (生活), Hk. religionary activity (宗教活動), Hl. superstitious belief activity (迷信活動), Hm. public security and judicature (公安/司法),

Hn. wicked behavior (惡行)

I. PHENOMENON AND CONDITION (現象與狀態): Ia. natural phenomena (自然現象), Ib. physiology phenomena (生理現象), Ic. facial expression (表情), Id. object status

(物體狀態), Ie. Situation (事態), If. circumstances (mostly unlucky) (境遇), Ig. the beginning and the end (始末), Ih. Change (變化)

J. TO BE RELATED (關聯): Ja. association (聯繫), Jb. similarities and dissimilarities (異同), Jc. to operate in coordination (配合), Jd. existence (存在), Je. Influence (影響)

K. AUXILIARY PHRASE (助語): Ka. quantitative modifier (疏狀), Kb. preposition (中介), Kc. conjunction (聯接), Kd. auxiliary (輔助), Ke. interjection (呼嘆), Kf.

Onomatopoeia (擬聲)

L. GREETINGS (敬語)


Degree of Polysemy in Mandarin Chinese

• Small categories of Cilin are used to compute the distribution of word senses.

• ASBC is employed to count frequency of a word

• Total 28,321 word types appear both in Cilin and in ASBC corpus.

• Total 5,922 words are polysemous.


Table 1. The Distribution of Word Senses

Low Ambiguity Middle Ambiguity High Ambiguity

Degree #Word Types Degree #Word

Types Degree

#Word

Types Degree

#Word

Types

2 4261 (71.95%) 5 186 (3.14%) 9 14

(0.24%) 14 1 (0.02%)

3 948 (16.01%) 6 77 (1.30%) 10 8 (0.14%) 15 1 (0.02%)

4 344 (5.81%) 7 42 (0.71%) 11 3 (0.05%) 17 1 (0.02%)

8 25 (0.42%) 12 4 (0.07%) 18 1 (0.02%)

13 5 (0.08%) 20 1 (0.02%)

Sum 5553 (93.77%) Sum 330 (5.57%) Sum 39 (0.66%)

Total Word Types 5922

degree: number of senses of a word word type: a dictionary entry

Table 2. The Distribution of Word Senses with Consideration of POS

POS

Degree N V A F K

2 1441 (81.05%) 1056 (71.79%) 580 (79.67%) 14 (77.78%) 101 (73.72%)

3 238 (13.39%) 238 (16.18%) 115 (15.80%) 4 (22.22%) 25 (18.25%) Low

4 55 (3.09%) 99 (6.73%) 20 (2.75%) 7 (5.11%)

5 26 (1.46%) 41 (2.79%) 9 (1.24%) 3 (2.19%)

6 12 (0.67%) 13 (0.88%) 2 (0.27%) 1 (0.73%)

7 3 (0.17%) 13 (0.88%) 2 (0.27%) Middle

8 2 (0.11%) 6 (0.40%)

9 1 (0.06%) 1 (0.07%)

11 1 (0.07%)

12 1 (0.07%)

13 1 (0.07%)

High

19 1 (0.07%)

Total Word

Types 1778 1471 728 18 137

4132

5922

97.53%94.70%for

N and V

打

98.22%97.08%

for

A and K


Table 3. The Distribution of Word Senses with Consideration of Frequencies

Low Ambiguity Middle Ambiguity High Ambiguity

Types

Tokens

#Token/

#Types

Types

Tokens

#Token/

#Types

Types

Tokens

#Tokens

#Types

5553 1143686 205.96 330 635796 1926.65 39 174731 4480.28

93.77% 58.52% 5.57% 32.53% 0.66% 8.94%

word token: an occurrence of word type in ASBC corpusword type: a dictionary entry

93.77% of polysemous words belong to the class of low ambiguity

they only occupy 58.52% of tokens in ASBC corpus


Table 4. The Distribution of Word Senses and Frequencies with Consideration of POS

Frequency

Ambiguity Low Middle High Sum Percentage

Types (C) 3112 734 147 3993 96.64% Tokens (A) 70131 230955 735819 1036905 85.52% Low

A/C 22.54 314.65 5005.57 259.68

Types (C) 42 62 29 133 3.22% Tokens (A) 1905 14667 153307 169879 14.01% Middle

A/C 45.36 236.56 5286.45 1277.29

Types (C) 0 2 4 6 0.15% Tokens (A) 0 843 4847 5690 0.47% High

A/C 0 421.5 1211.75 948.33

Types (C) 3154 798 180 4132 Sum Tokens (A) 72036 246465 893973 1212474

A/C 22.84 308.85 4966.52

Types (C) 76.33% 19.31% 4.36% % Tokens (A) 5.94% 20.33% 73.73%

Low frequency (< 100), Middle frequency (100…<1000), High frequency (1000)

2~4

5~8

>8


Phenomena

• POS information reduces the degree of ambiguities – Total 8.94% of word tokens are high ambiguous in

Table 3. It decreases to 0.47% in Table 4.

• High ambiguous words tend to be high frequent – 23.67% of word types are middle- or high-frequent

words, and they occupy 94.06% of word tokens


Semantic Tagging Unambiguous Words

• acquire the context for each semantic tag starting from the unambiguous words

CilinASBC

UnambiguousWords

those words that have only onesense in Cilin


Acquire Contextual Vectors• An unambiguous word is characterized by the surr

ounding words.• The window size is set to 6, and the stop words are

removed. • A sense tag Ctag is in terms of a vector (w1, w2, ...,

wn) – MI metric

– EM metric

NcwfCtagf

cwCtagfcwPCtagP

cwCtagPcwCtagMI ,

log,

log, 22

0,),(,

max,cwfCtagf

cwCtagEncwCtagfcwCtagem

N

cwfCtagfcwCtagEn ,


Semantic Tagging Ambiguous Words

• apply the information trained at the first stage to selecting the best sense tag from the candidates of each ambiguous word

AmbiguousWords

ASBC Cilin

UnambiguousWords

Those words that havemore than one sense in Cilin


Apply and Retrain Contextual Vectors

• Identify the context vector of an ambiguous word.

• Measure the similarity between a sense vector and a context vector by cosine formula.

• Select the sense tag of the highest similarity score.

• Retrain the sense vector for each sense tag.


Semantic Tagging Unknown Words

• Adopt outside evidences from the mapping among WordNet synsets and Cilin sense tags to narrow down the candidate set.

UnambiguousWords

AmbiguousWords

CilinASBC

Unknown

Words


Candidate List

Cw

ew1

ew2

ew3

syn11

syn12

syn21

syn22

syn23

syn31

Mapping Table

among WordNet synsets

and Cilin tags

Ctag1

Ctag2

Ctag3

C-EDict.

WordNet

vw

vwvw

,cos


Experiments

• Test Materials– Sample documents of different categories from ASBC c

orpus

– Total 35,921 words in the test corpus

– Research associates tag this corpus manually • Mark up the ambiguous words by looking up the Cilin dictiona

ry

• Tag the unknown words by looking up the mapping table

• The tag mapper achieves 82.52% of performance approximately


Performance of Tagging Ambiguous Words using MI

Ambiguity

Word Tokens

Low Middle High Summary

Total Tokens 6601 3511 989 11101

Correct

Tokens 4132 1101 267 5500

Correct Rate 62.60% 31.36% 27.00% 49.55%


Performance of Tagging Ambiguous Words using EM

Ambiguity

Word Tokens

Low Middle High Summary

Total Tokens 6601 3511 989 11101

Correct

Tokens 4223 1334 310 5867

Correct Rate 63.98% 37.99% 31.34% 52.85%

49.55%62.60% 31.36% 27.00%


Table 8. Performance of Tagging using First-n and Middle Categories

Ambiguity

First-n/Categories Low Middle High Middle and High

Small 63.98% 37.99% 31.34% 36.53% 1

Middle 71.02% 56.19% 43.78% 53.47%

Small 60.92% 53.99% 59.40% 2

Middle 73.88% 65.72% 72.09%

Small 71.35% 67.95% 70.60% 3

Middle 79.27% 75.94% 78.53%

●The performance for tagging low ambiguity (2-4), middle ambiguity (5-8)and high ambiguity (>8) is similar (i.e., 63.98%, 60.92% and 67.95%) when 1 candidate, 2 candidates, and 3 candidates are proposed.● Under the middle categories and 1-3 proposed candidates, the performancefor tagging low, middle and high ambiguous words are 71.02%, 73.88%,

and 75.94%.

Table 9. Performance of Tagging Unknown Words

Categorie

s

#Tokens Baseline M1 M2 P1 P2 M1(POS) Correct 20 443 395 438 396 561 All 1633

Precision 1.22% 27.13% 24.19% 26.82% 24.25% 34.35% Correct 11 255 228 255 231 320 N 858

Precision 1.28% 29.72% 26.57% 29.72% 26.92% 37.30% Correct 5 144 124 137 120 167 V 619

Precision 0.81% 23.26% 20.03% 22.13% 19.39% 26.98% Correct 0 5 5 5 5 28 A 58

Precision 0 8.62% 8.62% 8.62% 8.62% 48.28% Correct 1 1 1 1 1 4 F 4

Precision 25.00% 25.00% 25.00% 25.00% 25.00% 100.00% Correct 3 38 37 40 39 42 K 94

Precision 3.19% 40.43% 39.36 42.55 41.49 44.68%

Training from unambiguous

words

Training from unambiguous &

ambiguous words

More restrictive mapping

table

M1 P1

Less restrictive mapping

table

M2 P2


Co-Reference Resolution


Introduction

• Anaphora vs. Co-Reference– Anaphora

– 同指涉包括• Type/Instance: “老師” /“張三” , “ 一個好爸爸” /“張三”

• Function/Value: “現在的氣溫” /“攝氏 30 度”• NP 的同指涉關係 : “ 一隻小花貓” /“那隻貓”

張三是老師, 他 1 教學很認真, 同時, 他 2 也是一個好爸爸。現在的氣溫是攝氏 3 0 度。

例一例二


Flow of Co-Reference ResolutionDocument

output

Segmentation

Named Entity Extraction

Part of Speech Tagging

Find All Possible Candidates

Find Attributes of Candidates

Resolve Co-Reference

Cilin

Finite State Transition

AS Balanced Corpus

NP Chunker


Find the Candidate List

Co-Reference Resolution Algorithm

Document

All the

Candidates

SingletonsClass

1

Class

2

Class

N

DetermineCandidates


Find the Candidates• Select all nouns (Cand-Terms)

– Na ( 一般名詞 )– Nb (專有名詞 )– Nc (地方名詞 )– Nd ( 時間名詞 )– Nh ( 代名詞 )– Delete some Nds (total 171)

• e.g., “ 一會兒”、“稍後”、“剎那間” , “瞬間” , “ 幾時” , ...

• Select noun phrases (Cand-NP)

• Select maximal noun phrases (Cand-MaxNP)

Some are found in named entity extraction


adj

Neu

Init

Neqa

Nes

Ncd

NaNep

DENh

Nf

Recognize NPs whose head is Na (common noun)

其他 (Neqa) 的 (DE) 廠商 (Na)其他 (Neqa) 三 (Neu) 家 (Nf) 廠商 (Na)這 (Nep) 三 (Neu) 家 (Nf) 廠商 (Na) 該 (Nes) 廠商 (Na)


Recognize NPs whose head is Nh (pronoun)

Init

Na

Nb

Nh

塞南 (Nb) 本人 (Nh)太空船 (Na) 本身 (Nd)他 (Nh) 本人 (Nh)


Cilin (同義詞詞林 )

• 12 large categories

• 94 middle categories

• 1,428 small categories

• 3,925 word clusters


Features• Classification

– Word/Phrase Itself– Part of Speech of Head– Semantics of Head– Type of Named Entities– Positions (Sentences and Paragraphs)– Number: Singular, Plural and Unknown– Gender: Pronouns and Chinese Person Names– Pronouns: Personal Pronouns, Demonstrative Pronou

ns


Co-Reference Resolution Algorithms

• Strategy 1: simple pattern matching

• Strategy 2: Cardie Clustering Algorithm


Cardie Clustering Algorithm

Algorithm: Coreference_Clustering (NPn, NPn-1, …, NP1) 1. Let R be a radius of clustering. 2. Initially, a candidate itself forms a cluster, i.e., ci, ci = {NPi} 3. for j = n to 1 {

i = j-1 to 1 { (a) let d = dist(NPi , NPj) (b) assume ci (cj) is a class to which NPi (NPj) belongs (c) if d < R and All_NPs_Compatible (ci, cj) is true, then merge these two classes } }

Algorithm: All_NPs_Compatible (ci, cj) 1. For each NPa in ci and NPb in cj, compute dist (NPa, NPb). 2. If there exists a dist (NPa, NPb) = ∞ , then return false 3. return true

),(*),( jifFf fji NPNPilityincompatibwNPNPdist

feature f wf Incompatibility function (assume NPi before NPj) string 10 total different charaters ÷ length of maximal string

if head is a pronoun, then 0 is assigned head 1 if heads are different , then 1 is assigned, else 0 is assigned position 5 relative position of two terms i.t.o. sentences ÷ total sentences +

relative position of two terms i.t.o. paragraphs ÷ total paragraphs pronoun R if head of NPi is a pronoun but the head of NPj is not a pronoun, then 1 is

assigned, else 0 is assigned substring -∞ If heads of both terms are not pronouns and head of NPj is a substring of

head of NPi, then 1 is assigned, else 0 is assigned head is NE

-∞ If heads denote the same named entity, then 1 is assigned, else 0 is assigned.

synonym -∞ If heads belong to the same word cluster, then 1 is assigned, else 0 is assigned.

NP modifier

∞ If NPi is a modifier of NPj, then 1 is assigned, else 0 is assigned.

proper name

∞ If both terms are proper names but their types are different, then 1 is assigned, else 0 is assigned. If both terms belong to same type, but they do not have any common character, then 1is assigned, else 0 is assigned.

gender ∞ If gender is different, then 1 is assigned, else 0 is assigned. number ∞ If number is different, then 1 is assigned, else 0 is assigned. semantics ∞ s, s = SemanticFun_1(NPi ,NPj ) or s = SemanticFun_2(NPi ,NPj )


Semantics Restrictions

• SemanticFun_1(NPi ,NPj )

– If heads belong to the same word cluster, then 0 is assigned, else 1 is assigned.

• SemanticFun_2(NPi ,NPj )

– Integrate POS, Named Entity and Cilin sense• Only one is NE

• NPi and NPj are not NE, and they are in Cilin

• NPi and NPj are not NE, and only of them not in Cilin

• NPi and NPj are not NE, and both are not in Cilin 0


SemanticFun_2– One of them is NE

• Column denotes the type of NE

• Row denotes the part of speech

• English string in a table cell denotes Cilin sense

Na A Di, Dm, Bn Bd, Be, Bn, Cb Ca Ca Dj, Dn DnNb A 1 Bd, Be, Bn, Cb 1 1 1 1Nc 1 Di, Dm, Bn Bd, Be, Bn,Cb 1 1 1 1Nd 1 1 1 Ca Ca 1 1人稱 0 1 1 1 1 1 1泛指 0 0 0 0 0 0 0Nh指物 1 0 0 1 1 1 1

人名組織名地名時間日期金錢百分比

Experimental Results18 All cues (R=1.0) 3253 4441

5486/9745 56.3

5486/14684 37.4 44.91

7 Pattern matching 3238 5283 5466/9743

56.1 5466/14696

37.2 44.73

30 Delete sub-string 3224 3651 4191/9693

43.2 4191/10985

38.2 40.54

31 Delete head=NE 3253 4441 5486/9745

56.3 5486/14684

37.4 44.91

32 Delete Synonym 3252 4372 5414/9738

55.6 5414/14162

38.2 45.31

33 Delete NP modifier 3253 4434 5493/9745

56.4 5493/14705

37.4 44.93

34 Delete Proper Nouns 3254 4424 5490/9746

56.3 5490/14840

37.0 44.66

35 Delete Gender 3253 4441 5486/9745

56.3 5486/14695

37.3 44.89

36 Delete Number 3253 4436 5496/9746

56.4 5496/14784

37.2 44.81

37 Delete Semantics 3255 4625 5679/9752

58.2 5679/16198

35.1 43.77

39 Delete Synonym +

NP modifier

3249 4134 5783/9746

59.3 5783/13353

43.3 50.07


Named Entity Tagging Environment





tag at the same time



勝利 vs. 勝利工業公司



Named Entity Extraction in Bioinformatics Application

• Named Entities– Protein Name– Gene Name– …


Summary

• Segmentation

• Named Entity Extraction

• POS and Sense Tagging

• Co-Reference Resolution

• NE Tagging Environment

• Bioinformatics

Documents

SSIMIP-2002 (July 11) Chapter 13 Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National