Language Archiving- Document Annotation and Corpus Linguistics

Language Archiving- Document Annotation and Corpus Linguistics

Keh-Jiann ChenInstitute of Information science

Academia Sinica

The goals of NDAP are :(Quote from [Hsieh 2002, “Digital Media, Informatics, and Cultural Heritage “])

Preserving national cultural collections. Popularizing fine cultural holdings. Strengthening cultural heritage as well as guiding

cultural development. Popularizing knowledge and Improving Information

sharing. Enhancing education and learning. Bootstrapping cultural and value-added industries. Improving literacy, creativity and quality of life. Promoting International Cooperation and resource

sharing.

28

Space, Time and Language Coordinates for Digital ArchivesSpace, Time and Language Coordinates for Digital Archives

LanguageLanguage

TimeTime SpaceSpace

Language Language in Timein Time

HistoricalHistoricalGISGIS

Language Language in Spacein Space

Language Language in Text, in in Text, in Speech...Speech...

Language Changes

Digital Archives

Language variations

Digital Archives and TSL coordinates: (Quote from [Hsieh 2002, “Digital Media, Informatics, and Cultural Heritage “])

Language Archiving is a is a Collection of Linguistic ResourcesCollection of Linguistic Resources Collection of a linguistic archive (such Collection of a linguistic archive (such

as a balanced corpus) is guided by a sas a balanced corpus) is guided by a set of et of design criteriadesign criteria

Design CriteriaDesign Criteria define natural classes of texts in a collection

Each criterion establishes a dimension for comparative studies www.sinica.edu.tw/SinicaCorpus

How to make a single How to make a single archive more versatilearchive more versatile

One Corpus or Many Corpora?One Corpus or Many Corpora?

Or How to make a Balanced Corpus Biased?Or How to make a Balanced Corpus Biased?

With Textual Markup InformationWith Textual Markup Information (e.g. (e.g.

Metadata)Metadata)

genre, style, mode, topic, medium etc.genre, style, mode, topic, medium etc.

word, part-of-speech, structure tags, semantic word, part-of-speech, structure tags, semantic

tagstags

Alignment for heterogeneous corporaAlignment for heterogeneous corpora

Creating Synergy from Uniform Resource Type Each document is marked up with textual

description features: topic, style etc. Each feature selects a subset of

documents Sub-corpora (or new archives) can be

created online according to user’s specification

Creating Synergy from Uniform Resource Type Classical Chinese Corpora http://www.sinica.edu.tw/~tibe/2-words/old-words/index.html

Corpus of Formosan Austronesian Languages Under construction, part of the NationalDigital Archive Initiative

Lexical Databases of other Sino-Tibetan and Tibet

o-Burmese Languages

Creating Synergy from Heterogeneous Resource Type Bi-lingual or multi-lingual corpora

Text and speech aligned corpora

Synchronized corpora collected from

different areas

How to create a balanced corpus?

Creating of Sinica corpus – A word segmented modern Chinese corpus with pos

tagging

Introduction TEI : A corpus is a body of texts put

together in a principled way, typically in order to construct a sample of a given language or sublanguage.

It must be representative and balanced if it claims to faithfully represent the facts in that language or sublanguage [Sinclair 87].

Introduction Sinica balanced corpus

Texts are classified according to 5 different features: （ 1）Genre（ 2） Style（ 3）Mode（ 4） Topic（ 5）Medium

Word segmentation standard Segmentation standard for Chinese language

processing Http://godel.iis.sinica.edu.tw/ROCLING/

juhuashu1.htm Part-of-speech tagging

46 syntactic categories

Genre written reportagecommentaryadvertisementletterannouncementfictionprosebiography & diarypoetryanalectsmanual

spoken scriptconversationspeechmeeting minutes

Style

Mode

Topic

NarrationArgumentationExpositionDescription

writtenwritten-to-be-readwritten-to-be-spokenspokenspoken-to-be-written

philosophynatural sciencessocial sciencesfine artsgeneral/leisureliterature

Medium Newspapergeneral magazineacademic journaltextbookreference bookthesisgeneral bookaudio/visual mediainteractive speech

Sinica Corpus philosophy 10% natural sciences 10% social 35% arts 5% general/leisure 20% literature 20%

%% 文類 Genre= 報導 reportage

%% 文體 Style= 記敘 Description

%% 語式 Mode= written

%% 主題 Topic= 訊息 Message

%% 媒體 Medium= 報紙 Newspaper

%% 姓名 Author’s name=

%% 性別 Gender= 男女

%% 國籍 Nationality= 中華民國 Chinese

%% 母語 Mother tone= 中文 Chinese

%% 出版單位 Publisher= 中研院週報 Academia Sinica

%% 出版地 Place= 台北市台灣 Taipei Taiwan

%% 出版日期 date=1994

%% 版次 version=

%% 標題 Title= 國史研習會：中國宗教與社會

1. 　。 (PERIODCATEGORY) 　由 (P) 　本 (Nes) 　院 (Nc) 　歷史 (Na) 　語言 (Na) 　研究所 (Nc) 　主辦 (VC) 　， (COMMACATEGORY)

***********************************************

2. 　， (COMMACATEGORY) 　台灣 (Nc) 　大學 (Nc) 　歷史系 (Nc) 　暨 (Caa) 　研究所 (Nc) 　與 (Caa) 　清華 (Nb) 　大學 (Nc) 　歷史系 (Nc) 　暨 (Caa) 　研究所 (Nc) 　協辦 (VC) 　之 (DE) 　「 (PARENTHESISCATEGORY) 　國史(Na) 　研習會 (Na) 　： (COLONCATEGORY)

***********************************************

3. 　： (COLONCATEGORY) 　中國 (Nc) 　宗教 (Na) 　與 (Caa) 　社會 (Na) 　」 (PARENTHESISCATEGORY) 　，(COMMACATEGORY)

***********************************************

Design of Corpus Construction and Management System

Introduction Motivations for designing a corpus

management system It is hard to collect, maintain, classify,

tagging a large amount of texts without using a management system.

Automate the word segmentation and tagging processes.

Maintain the precision and consistency of data collection.

Handle the out-of-vocabulary words.

Database for Texts

Text Id

Text Id features

features text

text

record 1

field 1

record 2

field 2 field 3

Text database

ConstructionSystem

Tagged textTaggedtext

…

Construction Flow

Text Collection Module

網路 (WWW)

Text Files

text

text

Inspection System

New Word Editor

Unknown word Identification Module

text

Text & New words

Word Segmentation and Pos-tagging Module

text

Tagged Text Editor

Tagged Text

Revised Tagged text

Text Database(SQL)

Revised New WordsDomain Lexicons

Text Collection Module Purpose： Semi-automatically

collect the various texts from WWW.

Features： Automatic feature extraction and document classification.

Unknown Word Identification Module Identify new words before word

segmentation Methods：

Detect the existence of unknown words

Apply statistical rules and morphological rules to identify unknown words

Word Segmentation & Tagging Module Based on the word segmentation standard for

information processing, the segmentation program segments input text and tags the result words with their part-of-speeches.

Methods：word matching based on lexicon and newly identified words. Segmentation process： Longest matching

and heuristic rules to resolve the segmentation ambiguities.

Pos tagging : Bi-gram model for resolving pos ambiguities.

Word Segmentation & Tagging Module (cont) Additional features： Incorporate user defined

dictionary or domain dictionary to enhance the word segmentation accuracy. Domain dictionary： e.g. medical

dictionary, dictionary for computing terminology.

Extracted unknown words： New words, such as personal names, always occurred in text. The unknown word identification process will extract the unknown words and they will be the supplement of dictionary.

Unknown words extracted from text

General Lexicon

Text Tagged textWord

segmentation and tagging

台大本學期舉辦減重班

台大 (Nc) 　本 (Nes) 　學期 (Na) 　舉辦 (VC) 　減重班 (Na)

Domain Lexicon

Inspection System Purpose： To assure the quality of the corpus

collection, the automatic processed texts need to be verified by human experts. Thus an inspection system was designed to speed up the verification process.

Major functions : Editing functions： The errors of word breaks,

pos-tags, features, sentence breaks can be fixed by just clicking the mouse.

Reminder functions : The system will highlight the common errors, prefix, suffix in the text.

Short term memory : The system will recall the most recent modifications and fixed the same type of errors automatically.

Inspection System (cont)

Provide lexical information and examples：

Friendly user interface:

欲構建之語料庫

使用者

Web ServerSQL Server

詞典舊版本之語料庫

J 塑膠 (Na) 　皮 (Na)→ 塑膠皮 (Na)

J 公文 (Na) 　包 (VC)→ 公文包 (Na)

J 村 (Nc) 　上 (Ncd)→ 村上 (Nb)

J 毛利 (Na) 　遜 (VH)→ 毛利遜 (Nb)

J 吉姆 (Nb) 　毛利遜 (Nb)→ 吉姆毛利遜 (Nb)

D 世界級 (Na)→ 世界 (Nc) 　級 (Na)

D 科學方法 (Na)→ 科學 (Na) 　方法 (Na)

D 三代 (Nd)→ 三 (Neu) 　代 (Na)

D 交互作用 (Na)→ 交互 (VH) 　作用 (Na)

D 如一 (VH)→ 如 (P) 　一 (Neu)

C 改變 (VC)→ 改變 (Na)

C 傳統 (VH)→ 傳統 (Na)

C 企畫 (VC)→ 企畫 (Na)

C 自然 (D)→ 自然 (VH)

C 起來 (VA)→ 起來 (Di)

F 反射 (VJ)→ 反射 (VJ)[+nom]

F 遮雨 (VA)→ 遮雨 (VA)[+nom]

F 保持 (VJ)[+nom]→ 保持 (VJ)

F 萊特班 (Na)→ 萊特班 (Na)[+prop]

F 感動 (VHC)→ 感動 (VHC)[+nom]

Corpus Management System Advantages:

The corpus management system speeds up the construction processes and reduces the human efforts.

It also increases the precision and consistency of the word segmentation and pos-tagging.

Database system facilitates the functions of searching, managing, retrieving, and reorganizing texts.

Using Corpora

Reorganizing sub-corporaSearching tools

Reorganizing sub-corpora Sub-corpora can be reorganized

according to different features. Sport corpus Spoken corpus Corpus of the most recent tree

months News corpus Corpus of poetry

Corpus Searching ToolsKey word vectors

Key Word in Context(KWIC) Search

KWIC file

Filtering and Sorting

Display, or Print,or Store

Statistics colllocation

Corpus Searching Tools KWIC search

Key word vector what is matched [ 代表 , N, φ, φ] every word 代表 daibiao tagged

with the pos noun [φ,VA, φ, 1] all monosyllabic intransitiv

e verb(VA) [φ, φ,+fw,φ] all foreign words [.. 化 ,V, φ, 3] all tri-syllabic verb with suf

fix 化 hua '-ize'

Corpus Searching Tools Filtering

The filtering methods include: random sampling, removing redundant samples, removing irrelevant samples by restricting the

content in the window of key words. Displaying, printing, and storing

The result KWIC files can be displayed on screen, or printed,or stored for future processing.

Corpus Searching Tools Statistics:

Statistic functions provide statistical distributions of words and categories occurring within the context window of key words.

For instance, the category distribution of the word 把 ba.

Category Frequency % preposition P 2704 92.57 measure Nf 211 7.22 transitive verb Vc 3 0.10 determiner Neqb 2 0.07 noun Na 1 0.03

Corpus Searching Tools Collocation finding

The system finds collocations of the key words by computing the mutual information [Church & Hanks 90] of the key words with the words or parts-of-speech in a user defined window.

Mutual Information= Log P(X,Y)/P(X)*P(Y) I(x,y) >> 0 ： x,y are strongly associated. I(x,y) ≈ 0 ： x,y are unrelated. I(x,y) << 0 ： x,y are mutually exclusive.

Examples The top 16 collocations of ‘ 威脅’ within t

he window of distance 10. 1. 飽受 2. 恫嚇 3. 綑綁 4. 構成 5. 嚴重 6. 崩坍 7. 恐怖 8. 恐嚇 9. 遭受 10. 刀槍 11. 滾滾 12. 安全 13. 尖刀 14. 健康 15. 成全 16. 備受

Corpus Linguistics Corpus provides ample examples of

word uses and syntactic patterns. It also reflect the real uses of the language and their frequency distribution.

Comparative study can be made within KWIC or between sub-corpora.

Automatic knowledge extraction techniques can be performed on corpus to reduce manual efforts.

Lexicography Corpus provides ample examples of different w

ord uses and syntactic patterns. Corpus reflects the real uses of the language an

d their frequency distribution. Collocations show idiomatic patterns and they

are the most important uses of a word. Examples can be extracted from corpora. Senses and syntactic functions can be ordered a

ccording to their frequencies. CoBuild, Oxford, EDR, Collocation Dictionary of

Noun and Measure Words are examples of using corpora for editing dictionaries.

Language Modeling Markov Language Model: the probabilities a

re estimated from corpora. P(W1W2…Wm)= P(W1)*P(W2|W1)*P(W3|

W1W2)*…*P(Wn|W1W2…Wm-1) N-gram Model: P(W1W2…Wn) P(W1)*P

(W2|W1)*P(W3|W1W2)*…*P(Wn|Wm-n+1,…,Wm-1)

Language Modeling Applications of language modeling:

Inputting methods: speech recognition, character recognition, spelling check, phonetic input, …

Data compression: Huffman coding, Arithmetic Coding,…

Categorization: Text classification, pos tagging, sense disambiguation, word segmentation,…

Machine Translation IBM [Brown etc. 1990] used the bi-lingual H

ansard corpus to build translation models. To translate a French sentence F to an En

glish sentence E is equivalent to find the E which maximize P(E)*P(E|F).

P(E) is estimated from bi-gram model. P(E|F) is estimated from aligned bi-lingua

l corpus.

Conclusion Language archive is not only the

most important culture heritage but also the most important resources for language research.

The computer tools makes the archiving more efficient and manageable.

Everyone can access the archive through WWW.

Websites: Corpora and Archives

Sinica Corpus (Academia Sinica Balanced Corpus of Modern Chinese)

www.sinica.edu.tw/SinicaCorpus

Academia Sinica Classical Chinese Corpora: Early Mandarin

www.sinica.edu.tw/Early_Mandarin

Academia Sinica Formosan Language Archive: Rukai(Mantauran)

www.ling.sinica.edu.tw/formosan

Websites: Digital Museums

Chinese Language KnowledgeNets

WenGuo: Adventures in Wen-Land

http://www.sinica.edu.tw/wen

SouWenJieZi

http://www.dmpo.sinica.edu.tw/~words

5 million words, segmented and taggedDirect WWW Access

-http://www.sinica.edu.tw/ftms-bin/kiwi.sh

License Informationhttp://rocling.iis.sinica.edu.tw/ROCLING/corpus98/sinicor_E.htm

Academia Sinica Balanced Corpus of Mandarin Chinese (Sinica Corpus)

Sinica Treebank 1.038,725 Trees

239,532 Words

Direct WWW Access (1000 sample trees)http://godel.iis.sinica.edu.tw/CKIP/trees1000.htm

License Informationhttp://rocling.iis.sinica.edu.tw/ROCLING/Treebank/Treebank-E.htm

Mandarin-Across-Taiwan (MAT) Speech Database

Speech files are collected through telephone networks. The content Includes spontaneous speech (short answering statements) and read speech (numbers, Mandarin syllables, words of 2 to 4 syllables, phonetically balanced sentences).MAT-160 (160 speakers)

MAT-2000 http://rocling.iis.sinica.edu.tw/ROCLING/MAT/index_cf.htm

Documents

Language Archiving- Document Annotation and Corpus Linguistics