Developing Asian Language Corpora: standards and practice

25/03/2004 ALR04 - Sanya, China1

Developing Asian Language Corpora: standards and practice

Richard Xiao

Tony McEnery

Paul Baker

Andrew Hardie

Lancaster University

25/03/2004ALR04 - Sanya, China2

An overview of the talk

Corpus development standards The EMILLE (Enabling Minority Language

Engineering) Corpus The Lancaster Corpus of Mandarin Chinese

(LCMC) XML-aware, Unicode-compliant corpus

exploration tools Software demonstration


Corpus development standards (1)

Why is standardization important?– To be compliant with major international standards– To facilitate electronic data exchange– To foster cooperation and coordination between

different centres and projects– To meet the requirements of corpus validation

The ALR Committee is working in the right direction



Corpus constituents– Corpus manifest

Type (paper document, computer file, audio/video recording, etc.) Carrier (computer file name and location, document title etc.) Status (integral part of corpus, descriptive metadata, associated

annotation, documentation, etc.) Digital components and the storage format (character encoding,

binary format, record structure, etc.)

– Primary data: corpus files– Ancillary data: corpus documentation



Data formats– Primary data

Text files: XML/SGML conforming to a standard or supplied DTD or schema

Audio: MP3 or WAV Video: MPEG or Quicktime Image files: PNG or JPG

– Ancillary data Documentation: PDF, HTML, or XML



File structure, markup and annotation– Corpus header

providing metadata about the corpus file TEI/CES-compliance

– Corpus body Containing the corpus data TEI/CES-compliance Markup for paragraphs and sentences Preferably annotated with various levels of linguistic analysis (POS

tagging…)

Character encoding– Unicode-compliance (UTF-8/16)


The EMILLE project

The EMILLE project– Funded by the UK EPSRC (Grant references GR/N19106,

GR/M70735, GR/N28542 and GR/R42429/01)– Research partners: Lancaster University, Sheffield University,

and the Central Institute of Languages (CIIL) in Mysore, India– Three main goals

To build corpora of South Asian languages To extend the GATE (General Architecture for Text Engineering)

LE architecture To develop basic LE tools

– Project site: http://www.emille.lancs.ac.uk/– GATE: http://gate.ac.uk/sale/tao/index.html#x1-550002.26


The EMILLE Corpus: An overview

Three components– Monolingual, Annotated, and Parallel

14 South Asian languages– Spoken data for five language

Monolingual corpora contain more than 96 million words– Spoken data over 2.6 million words

The Urdu corpus is POS tagged Part of the Hindi corpus is annotated for anaphora Parallel corpus covers English and five South Asian languages Corpus building tools: Uni-codify, Uni-viewer, Uni-editor


The EMILLE Monolingual Corpora

Language Written Spoken Total

Assamese 2,620,000 0 2,620,000

Bengali 5,520,000 442,000 5,962,000Gujarati 12,150,000 564,000 12,714,000Hindi 12,390,000 588,000 12,978,000Kannada 2,240,000 0 2,240,000Kashmiri 2,270,000 0 2,270,000Malayalam 2,350,000 0 2,350,000

Marathi 2,210,000 0 2,210,000Oriya 2,730,000 0 2,730,000Punjabi 15,600,000 521,000 16,121,000Sinhala 6,860,000 0 6,860,000

Tamil 19,980,000 0 19,980,000

Telugu 3,970,000 0 3,970,000

Urdu 1,640,000 512,000 2,152,000

Total 93,530,000 2,627,000 96,157,000


The EMILLE Annotated Corpora

POS tagging– The whole monolingual Urdu corpus– The Urdu component of the EMILLE Parallel

Corpora

Anaphoric annotation– Around 100,000 words of news material (20

excerpts from the Ranchi Express data) from the Hindi Monolingual Corpus


The EMILLE Parallel Corpus

75 advice leaflets published by the UK government

Approximately 200,000 words of English originals with accompanying translations in five South Asian languages– Hindi, Bengali, Punjabi, Gujarati, and Urdu

Covering a range of term-rich domains


The EMILLE corpus building tools

Uni-codify– Allows users to convert 30 (or so) different 8-bit encodings of

South Asian scripts into 16-bit little-endian Unicode format– Compiled program accompanied by documentation

Uni-Viewer– Allows users to view Unicode texts

Uni-Editor– Allows users to edit Unicode texts

Urdu POS tagger– POS tagging Unicode-encoded Urdu texts– Accompanied by the tagset and the user manual


The EMILLE Corpus: Availability

The full release of the EMILLE Corpus and tools are distributed free of charge for use in non-profit-making research

Digital sound files will also be released soon Indexed version for use with Xara will be

available soon Corpus download site

– http://www.ling.lancs.ac.uk/corplang/emille


The LCMC Corpus: Aims

Built for the ESRC project Contrasting tense and aspect in English and Chinese (Grant Ref. RES-000-220135)

A Chinese match for FLOB/Frown for BrE/AmE A publicly available balanced corpus of

Mandarin Chinese Distributed free of charge for use in non-profit-

making research


LCMC: Profile

One million words 1990-1993 15 text categories 500 text samples Major text provider: SSReader Digital Library in China Unicode (UTF-8) XML-conformant mark-up Marked for paragraphs and sentences POS-tagged (precision rate 98%+) Standard character and Romanized Pinyin versions


Major Chinese corpus resources

Corpus POS Bal. Channel Variety Contr.

LCMC Yes Yes Written Mainland E – C

Sinica Yes Yes Mixed Taiwan No

PH No No Written Mainland No

PFR Yes No Written Mainland No

LIVAC No No Written Mixed C – C

SCCSD No Yes Spoken Mainland No

TREC No No Written Mainland No

Gigaword No No Written Mainland No

Callhome No ? Spoken Mixed No


LCMC: Sampling frame

Code Text category Samples Proportion

A Press reportage 44 8.8%B Press editorials 27 5.4%C Press reviews 17 3.4%D Religion 17 3.4%E Skills/trades/hobbies 38 7.6%F Popular lore 44 8.8%G Biographies/essays 77 15.4%H Miscellaneous 30 6%J Science 80 16%K General fiction 29 5.8%L Mystery/detective fiction 24 4.8%

M Science fiction 6 1.2%N Western/adventure fiction 29 5.8%

P Romantic fiction 29 5.8%R Humor 9 1.8%Total 500 100%


LCMC: Markup

Level Code Gloss Attribute Value

1 text Text type TYPE As per Table 2 Text Category

ID As per Table 2 Code

2 file Corpus file ID Text ID plus file number starting from 01

3 p Paragraph --- ---

4 s Sentence n Starting from 0001 onwards

5 w Word POS Part-of-speech tags as per the LCMC tagsetc Punctuation

and symbol

gap Omission --- ---


LCMC: Annotation

Segmentation POS tagging

– Applying the Peking University tagset 26 Level 1 POS tags 50 Level 2 POS tags

– ICTCLAS (Chinese Lexical Analysis System) Developed by the Institute of Computing Technology, Chinese Academy

of Sciences (Zhang & Liu 2002) A frequency dictionary of 80,000 words Based on a multi-layer hidden Markov model Applying the n-shortest paths method

– Automatic tagging with a precision rate of 97.16%– Post-editing improved the precision to over 98%


LCMC: Potential use

Monolingual study– Studying Mandarin Chinese as a whole– Exploring variation across text categories

Contrastive study (in conjunction with FLOB/Frown)– Contrasting Chinese and BrE/AmE– Contrasting text categories in Chinese and English


LCMC: Availability

Distributed free of charge for use in non-profit-making research

Accompanied by the user manual Online search available via WebConc The LCMC website

– http://www.ling.lancs.ac.uk/corplang/lcmc The Chinese mirror site (Chinese Academy of Social

Science)– http://www.cass.net.cn/chinese/s18_yys/dangdai/LCMC/

LCMC.htm


Corpus exploration tools

XML-aware, Unicode-compliant corpus exploration tools– The WordSmith Tools version 4

Presently under beta test Beta version available

– http://www.lexically.net/wordsmith/version4/index.htm

– Xara (XML-aware Sara) Sara: SGML-aware Retrieval Application

– For use with the British National Corpus (BNC) For either local or remote access Presently under beta test Documentation available at http://www.oucs.ox.ac.uk/rts/xara/ A tutorial available at the LCMC website


Software demonstration

Using Xara for local access to LCMC– Query types: Quick query, word query (pattern), POS query,

pattern query (regex), Query builder (e.g. a-n vs. a-de-n), etc– Display mode: KWIC mode vs. sentence mode– Display format: Plain vs. XML– Status bar: Reference– Other useful features: distribution, sort, collocation, partition,

user-defined stylesheets, etc. Using Xara to for local access to EMILLE Using WebConC to access LCMC

– http://www.ling.lancs.ac.uk/corplang/lcmc


And…

Thank you!Richard Xiao

[email protected]

Documents

Developing Asian Language Corpora: standards and practice