56
Language Archiving- Document Annotation and Corpus Linguistics Keh-Jiann Chen Institute of Information science Academia Sinica

Language Archiving- Document Annotation and Corpus Linguistics

  • Upload
    thom

  • View
    55

  • Download
    0

Embed Size (px)

DESCRIPTION

Language Archiving- Document Annotation and Corpus Linguistics. Keh-Jiann Chen Institute of Information science Academia Sinica. The goals of NDAP are : (Quote from [Hsieh 2002, “ Digital Media, Informatics, and Cultural Heritage “ ]). Preserving national cultural collections. - PowerPoint PPT Presentation

Citation preview

Page 1: Language Archiving- Document Annotation and Corpus Linguistics

Language Archiving- Document Annotation and Corpus Linguistics

Keh-Jiann ChenInstitute of Information science

Academia Sinica

Page 2: Language Archiving- Document Annotation and Corpus Linguistics

The goals of NDAP are :(Quote from [Hsieh 2002, “Digital Media, Informatics, and Cultural Heritage “])

Preserving national cultural collections. Popularizing fine cultural holdings. Strengthening cultural heritage as well as guiding

cultural development. Popularizing knowledge and Improving Information

sharing. Enhancing education and learning. Bootstrapping cultural and value-added industries. Improving literacy, creativity and quality of life. Promoting International Cooperation and resource

sharing.

Page 3: Language Archiving- Document Annotation and Corpus Linguistics

28

Space, Time and Language Coordinates for Digital ArchivesSpace, Time and Language Coordinates for Digital Archives

LanguageLanguage

TimeTime SpaceSpace

Language Language in Timein Time

HistoricalHistoricalGISGIS

Language Language in Spacein Space

Language Language in Text, in in Text, in Speech...Speech...

Language Changes

Digital Archives

Language variations

Digital Archives and TSL coordinates: (Quote from [Hsieh 2002, “Digital Media, Informatics, and Cultural Heritage “])

Page 4: Language Archiving- Document Annotation and Corpus Linguistics

Language Archiving is a is a Collection of Linguistic ResourcesCollection of Linguistic Resources Collection of a linguistic archive (such Collection of a linguistic archive (such

as a balanced corpus) is guided by a sas a balanced corpus) is guided by a set of et of design criteriadesign criteria

Design CriteriaDesign Criteria define natural classes of texts in a collection

Each criterion establishes a dimension for comparative studies www.sinica.edu.tw/SinicaCorpus

Page 5: Language Archiving- Document Annotation and Corpus Linguistics

How to make a single How to make a single archive more versatilearchive more versatile

One Corpus or Many Corpora?One Corpus or Many Corpora?

Or How to make a Balanced Corpus Biased?Or How to make a Balanced Corpus Biased?

With Textual Markup InformationWith Textual Markup Information (e.g. (e.g.

Metadata)Metadata)

genre, style, mode, topic, medium etc.genre, style, mode, topic, medium etc.

word, part-of-speech, structure tags, semantic word, part-of-speech, structure tags, semantic

tagstags

Alignment for heterogeneous corporaAlignment for heterogeneous corpora

Page 6: Language Archiving- Document Annotation and Corpus Linguistics

Creating Synergy from Uniform Resource Type Each document is marked up with textual

description features: topic, style etc. Each feature selects a subset of

documents Sub-corpora (or new archives) can be

created online according to user’s specification

Page 7: Language Archiving- Document Annotation and Corpus Linguistics

Creating Synergy from Uniform Resource Type Classical Chinese Corpora http://www.sinica.edu.tw/~tibe/2-words/old-words/index.html

Corpus of Formosan Austronesian Languages Under construction, part of the NationalDigital Archive Initiative

Lexical Databases of other Sino-Tibetan and Tibet

o-Burmese Languages

Page 8: Language Archiving- Document Annotation and Corpus Linguistics

Creating Synergy from Heterogeneous Resource Type Bi-lingual or multi-lingual corpora

Text and speech aligned corpora

Synchronized corpora collected from

different areas

Page 9: Language Archiving- Document Annotation and Corpus Linguistics

How to create a balanced corpus?

Creating of Sinica corpus – A word segmented modern Chinese corpus with pos

tagging

Page 10: Language Archiving- Document Annotation and Corpus Linguistics

Introduction TEI : A corpus is a body of texts put

together in a principled way, typically in order to construct a sample of a given language or sublanguage.

It must be representative and balanced if it claims to faithfully represent the facts in that language or sublanguage [Sinclair 87].

Page 11: Language Archiving- Document Annotation and Corpus Linguistics

Introduction Sinica balanced corpus

Texts are classified according to 5 different features: ( 1)Genre( 2) Style( 3)Mode( 4) Topic( 5)Medium

Word segmentation standard Segmentation standard for Chinese language

processing Http://godel.iis.sinica.edu.tw/ROCLING/

juhuashu1.htm Part-of-speech tagging

46 syntactic categories

Page 12: Language Archiving- Document Annotation and Corpus Linguistics

Genre written reportagecommentaryadvertisementletterannouncementfictionprosebiography & diarypoetryanalectsmanual

spoken scriptconversationspeechmeeting minutes

Style

Mode

Topic

NarrationArgumentationExpositionDescription

writtenwritten-to-be-readwritten-to-be-spokenspokenspoken-to-be-written

philosophynatural sciencessocial sciencesfine artsgeneral/leisureliterature

Medium Newspapergeneral magazineacademic journaltextbookreference bookthesisgeneral bookaudio/visual mediainteractive speech

Page 13: Language Archiving- Document Annotation and Corpus Linguistics

Sinica Corpus philosophy 10% natural sciences 10% social 35% arts 5% general/leisure 20% literature 20%

Page 14: Language Archiving- Document Annotation and Corpus Linguistics

%% 文類 Genre= 報導 reportage

%% 文體 Style= 記敘 Description

%% 語式 Mode= written

%% 主題 Topic= 訊息 Message

%% 媒體 Medium= 報紙 Newspaper

%% 姓名 Author’s name=

%% 性別 Gender= 男女

%% 國籍 Nationality= 中華民國 Chinese

%% 母語 Mother tone= 中文 Chinese

%% 出版單位 Publisher= 中研院週報 Academia Sinica

%% 出版地 Place= 台北市台灣 Taipei Taiwan

%% 出版日期 date=1994

%% 版次 version=

%% 標題 Title= 國史研習會:中國宗教與社會

1.  。 (PERIODCATEGORY)  由 (P)  本 (Nes)  院 (Nc)  歷史 (Na)  語言 (Na)  研究所 (Nc)  主辦 (VC)  , (COMMACATEGORY)

***********************************************

2.  , (COMMACATEGORY)  台灣 (Nc)  大學 (Nc)  歷史系 (Nc)  暨 (Caa)  研究所 (Nc)  與 (Caa)  清華 (Nb)  大學 (Nc)  歷史系 (Nc)  暨 (Caa)  研究所 (Nc)  協辦 (VC)  之 (DE)  「 (PARENTHESISCATEGORY)  國史(Na)  研習會 (Na)  : (COLONCATEGORY)

***********************************************

3.  : (COLONCATEGORY)  中國 (Nc)  宗教 (Na)  與 (Caa)  社會 (Na)  」 (PARENTHESISCATEGORY)  ,(COMMACATEGORY)

***********************************************

Page 15: Language Archiving- Document Annotation and Corpus Linguistics

Design of Corpus Construction and Management System

Page 16: Language Archiving- Document Annotation and Corpus Linguistics

Introduction Motivations for designing a corpus

management system It is hard to collect, maintain, classify,

tagging a large amount of texts without using a management system.

Automate the word segmentation and tagging processes.

Maintain the precision and consistency of data collection.

Handle the out-of-vocabulary words.

Page 17: Language Archiving- Document Annotation and Corpus Linguistics

Database for Texts

Text Id

Text Id features

features text

text

record 1

field 1

record 2

field 2 field 3

Text database

ConstructionSystem

Tagged textTaggedtext

Page 18: Language Archiving- Document Annotation and Corpus Linguistics

Construction Flow

Text Collection Module

網路 (WWW)

Text Files

text

text

Inspection System

New Word Editor

Unknown word Identification Module

text

Text & New words

Word Segmentation and Pos-tagging Module

text

Tagged Text Editor

Tagged Text

Revised Tagged text

Text Database(SQL)

Revised New WordsDomain Lexicons

Page 19: Language Archiving- Document Annotation and Corpus Linguistics

Text Collection Module Purpose: Semi-automatically

collect the various texts from WWW.

Features: Automatic feature extraction and document classification.

Page 20: Language Archiving- Document Annotation and Corpus Linguistics
Page 21: Language Archiving- Document Annotation and Corpus Linguistics

Unknown Word Identification Module Identify new words before word

segmentation Methods:

Detect the existence of unknown words

Apply statistical rules and morphological rules to identify unknown words

Page 22: Language Archiving- Document Annotation and Corpus Linguistics
Page 23: Language Archiving- Document Annotation and Corpus Linguistics

Word Segmentation & Tagging Module Based on the word segmentation standard for

information processing, the segmentation program segments input text and tags the result words with their part-of-speeches.

Methods:word matching based on lexicon and newly identified words. Segmentation process: Longest matching

and heuristic rules to resolve the segmentation ambiguities.

Pos tagging : Bi-gram model for resolving pos ambiguities.

Page 24: Language Archiving- Document Annotation and Corpus Linguistics

Word Segmentation & Tagging Module (cont) Additional features: Incorporate user defined

dictionary or domain dictionary to enhance the word segmentation accuracy. Domain dictionary: e.g. medical

dictionary, dictionary for computing terminology.

Extracted unknown words: New words, such as personal names, always occurred in text. The unknown word identification process will extract the unknown words and they will be the supplement of dictionary.

Page 25: Language Archiving- Document Annotation and Corpus Linguistics

Unknown words extracted from text

General Lexicon

Text Tagged textWord

segmentation and tagging

台大本學期舉辦減重班

台大 (Nc)  本 (Nes)  學期 (Na)  舉辦 (VC)  減重班 (Na)

Domain Lexicon

Page 26: Language Archiving- Document Annotation and Corpus Linguistics

Inspection System Purpose: To assure the quality of the corpus

collection, the automatic processed texts need to be verified by human experts. Thus an inspection system was designed to speed up the verification process.

Major functions : Editing functions: The errors of word breaks,

pos-tags, features, sentence breaks can be fixed by just clicking the mouse.

Reminder functions : The system will highlight the common errors, prefix, suffix in the text.

Short term memory : The system will recall the most recent modifications and fixed the same type of errors automatically.

Page 27: Language Archiving- Document Annotation and Corpus Linguistics

Inspection System (cont)

Provide lexical information and examples:

Friendly user interface:

欲構建之語料庫

使用者

Web ServerSQL Server

詞典 舊版本之語料庫

Page 28: Language Archiving- Document Annotation and Corpus Linguistics
Page 29: Language Archiving- Document Annotation and Corpus Linguistics

J 塑膠 (Na)  皮 (Na)→ 塑膠皮 (Na)

J 公文 (Na)  包 (VC)→ 公文包 (Na)

J 村 (Nc)  上 (Ncd)→ 村上 (Nb)

J 毛利 (Na)  遜 (VH)→ 毛利遜 (Nb)

J 吉姆 (Nb)  毛利遜 (Nb)→ 吉姆毛利遜 (Nb)

D 世界級 (Na)→ 世界 (Nc)  級 (Na)

D 科學方法 (Na)→ 科學 (Na)  方法 (Na)

D 三代 (Nd)→ 三 (Neu)  代 (Na)

D 交互作用 (Na)→ 交互 (VH)  作用 (Na)

D 如一 (VH)→ 如 (P)  一 (Neu)

C 改變 (VC)→ 改變 (Na)

C 傳統 (VH)→ 傳統 (Na)

C 企畫 (VC)→ 企畫 (Na)

C 自然 (D)→ 自然 (VH)

C 起來 (VA)→ 起來 (Di)

F 反射 (VJ)→ 反射 (VJ)[+nom]

F 遮雨 (VA)→ 遮雨 (VA)[+nom]

F 保持 (VJ)[+nom]→ 保持 (VJ)

F 萊特班 (Na)→ 萊特班 (Na)[+prop]

F 感動 (VHC)→ 感動 (VHC)[+nom]

Page 30: Language Archiving- Document Annotation and Corpus Linguistics

Corpus Management System Advantages:

The corpus management system speeds up the construction processes and reduces the human efforts.

It also increases the precision and consistency of the word segmentation and pos-tagging.

Database system facilitates the functions of searching, managing, retrieving, and reorganizing texts.

Page 31: Language Archiving- Document Annotation and Corpus Linguistics

Using Corpora

Reorganizing sub-corporaSearching tools

Page 32: Language Archiving- Document Annotation and Corpus Linguistics
Page 33: Language Archiving- Document Annotation and Corpus Linguistics
Page 34: Language Archiving- Document Annotation and Corpus Linguistics

Reorganizing sub-corpora Sub-corpora can be reorganized

according to different features. Sport corpus Spoken corpus Corpus of the most recent tree

months News corpus Corpus of poetry

Page 35: Language Archiving- Document Annotation and Corpus Linguistics

Corpus Searching ToolsKey word vectors

Key Word in Context(KWIC) Search

KWIC file

Filtering and Sorting

Display, or Print,or Store

Statistics colllocation

Page 36: Language Archiving- Document Annotation and Corpus Linguistics

Corpus Searching Tools KWIC search

Key word vector what is matched   [ 代表 , N, φ, φ] every word 代表 daibiao tagged

with the pos noun [φ,VA, φ, 1] all monosyllabic intransitiv

e verb(VA) [φ, φ,+fw,φ] all foreign words [.. 化 ,V, φ, 3] all tri-syllabic verb with suf

fix 化 hua '-ize'

Page 37: Language Archiving- Document Annotation and Corpus Linguistics
Page 38: Language Archiving- Document Annotation and Corpus Linguistics

Corpus Searching Tools Filtering

The filtering methods include: random sampling, removing redundant samples, removing irrelevant samples by restricting the

content in the window of key words. Displaying, printing, and storing

The result KWIC files can be displayed on screen, or printed,or stored for future processing.

Page 39: Language Archiving- Document Annotation and Corpus Linguistics
Page 40: Language Archiving- Document Annotation and Corpus Linguistics

Corpus Searching Tools Statistics:

Statistic functions provide statistical distributions of words and categories occurring within the context window of key words.

For instance, the category distribution of the word 把 ba.

Category Frequency % preposition P 2704 92.57 measure Nf 211 7.22 transitive verb Vc 3 0.10 determiner Neqb 2 0.07 noun Na 1 0.03

Page 41: Language Archiving- Document Annotation and Corpus Linguistics
Page 42: Language Archiving- Document Annotation and Corpus Linguistics

Corpus Searching Tools Collocation finding

The system finds collocations of the key words by computing the mutual information [Church & Hanks 90] of the key words with the words or parts-of-speech in a user defined window.

Mutual Information= Log P(X,Y)/P(X)*P(Y) I(x,y) >> 0 : x,y are strongly associated. I(x,y) ≈ 0 : x,y are unrelated. I(x,y) << 0 : x,y are mutually exclusive.

Page 43: Language Archiving- Document Annotation and Corpus Linguistics

Examples The top 16 collocations of ‘ 威脅’ within t

he window of distance 10. 1. 飽受 2. 恫嚇 3. 綑綁 4. 構成 5. 嚴重 6. 崩坍 7. 恐怖 8. 恐嚇 9. 遭受 10. 刀槍 11. 滾滾 12. 安全 13. 尖刀 14. 健康 15. 成全 16. 備受

Page 44: Language Archiving- Document Annotation and Corpus Linguistics
Page 45: Language Archiving- Document Annotation and Corpus Linguistics
Page 46: Language Archiving- Document Annotation and Corpus Linguistics

Corpus Linguistics Corpus provides ample examples of

word uses and syntactic patterns. It also reflect the real uses of the language and their frequency distribution.

Comparative study can be made within KWIC or between sub-corpora.

Automatic knowledge extraction techniques can be performed on corpus to reduce manual efforts.

Page 47: Language Archiving- Document Annotation and Corpus Linguistics

Lexicography Corpus provides ample examples of different w

ord uses and syntactic patterns. Corpus reflects the real uses of the language an

d their frequency distribution. Collocations show idiomatic patterns and they

are the most important uses of a word. Examples can be extracted from corpora. Senses and syntactic functions can be ordered a

ccording to their frequencies. CoBuild, Oxford, EDR, Collocation Dictionary of

Noun and Measure Words are examples of using corpora for editing dictionaries.

Page 48: Language Archiving- Document Annotation and Corpus Linguistics

Language Modeling Markov Language Model: the probabilities a

re estimated from corpora. P(W1W2…Wm)= P(W1)*P(W2|W1)*P(W3|

W1W2)*…*P(Wn|W1W2…Wm-1) N-gram Model: P(W1W2…Wn) P(W1)*P

(W2|W1)*P(W3|W1W2)*…*P(Wn|Wm-n+1,…,Wm-1)

Page 49: Language Archiving- Document Annotation and Corpus Linguistics

Language Modeling Applications of language modeling:

Inputting methods: speech recognition, character recognition, spelling check, phonetic input, …

Data compression: Huffman coding, Arithmetic Coding,…

Categorization: Text classification, pos tagging, sense disambiguation, word segmentation,…

Page 50: Language Archiving- Document Annotation and Corpus Linguistics

Machine Translation IBM [Brown etc. 1990] used the bi-lingual H

ansard corpus to build translation models. To translate a French sentence F to an En

glish sentence E is equivalent to find the E which maximize P(E)*P(E|F).

P(E) is estimated from bi-gram model. P(E|F) is estimated from aligned bi-lingua

l corpus.

Page 51: Language Archiving- Document Annotation and Corpus Linguistics

Conclusion Language archive is not only the

most important culture heritage but also the most important resources for language research.

The computer tools makes the archiving more efficient and manageable.

Everyone can access the archive through WWW.

Page 52: Language Archiving- Document Annotation and Corpus Linguistics

Websites: Corpora and Archives

Sinica Corpus (Academia Sinica Balanced Corpus of Modern Chinese)

www.sinica.edu.tw/SinicaCorpus

Academia Sinica Classical Chinese Corpora: Early Mandarin

www.sinica.edu.tw/Early_Mandarin

Academia Sinica Formosan Language Archive: Rukai(Mantauran)

www.ling.sinica.edu.tw/formosan

Page 53: Language Archiving- Document Annotation and Corpus Linguistics

Websites: Digital Museums

Chinese Language KnowledgeNets

WenGuo: Adventures in Wen-Land

http://www.sinica.edu.tw/wen

SouWenJieZi

http://www.dmpo.sinica.edu.tw/~words

Page 54: Language Archiving- Document Annotation and Corpus Linguistics

5 million words, segmented and taggedDirect WWW Access

-http://www.sinica.edu.tw/ftms-bin/kiwi.sh

License Informationhttp://rocling.iis.sinica.edu.tw/ROCLING/corpus98/sinicor_E.htm

Academia Sinica Balanced Corpus of Mandarin Chinese (Sinica Corpus)

Page 55: Language Archiving- Document Annotation and Corpus Linguistics

Sinica Treebank 1.038,725 Trees

239,532 Words

Direct WWW Access (1000 sample trees)http://godel.iis.sinica.edu.tw/CKIP/trees1000.htm

License Informationhttp://rocling.iis.sinica.edu.tw/ROCLING/Treebank/Treebank-E.htm

Page 56: Language Archiving- Document Annotation and Corpus Linguistics

Mandarin-Across-Taiwan (MAT) Speech Database

Speech files are collected through telephone networks. The content Includes spontaneous speech (short answering statements) and read speech (numbers, Mandarin syllables, words of 2 to 4 syllables, phonetically balanced sentences).MAT-160 (160 speakers)

MAT-2000 http://rocling.iis.sinica.edu.tw/ROCLING/MAT/index_cf.htm