Upload
keona
View
14
Download
0
Embed Size (px)
DESCRIPTION
Developing Asian Language Corpora: standards and practice. Richard Xiao Tony McEnery Paul Baker Andrew Hardie Lancaster University. An overview of the talk. Corpus development standards The EMILLE (Enabling Minority Language Engineering) Corpus - PowerPoint PPT Presentation
Citation preview
25/03/2004 ALR04 - Sanya, China1
Developing Asian Language Corpora: standards and practice
Richard Xiao
Tony McEnery
Paul Baker
Andrew Hardie
Lancaster University
25/03/2004ALR04 - Sanya, China2
An overview of the talk
Corpus development standards The EMILLE (Enabling Minority Language
Engineering) Corpus The Lancaster Corpus of Mandarin Chinese
(LCMC) XML-aware, Unicode-compliant corpus
exploration tools Software demonstration
25/03/2004ALR04 - Sanya, China3
Corpus development standards (1)
Why is standardization important?– To be compliant with major international standards– To facilitate electronic data exchange– To foster cooperation and coordination between
different centres and projects– To meet the requirements of corpus validation
The ALR Committee is working in the right direction
25/03/2004ALR04 - Sanya, China4
Corpus development standards (2)
Corpus constituents– Corpus manifest
Type (paper document, computer file, audio/video recording, etc.) Carrier (computer file name and location, document title etc.) Status (integral part of corpus, descriptive metadata, associated
annotation, documentation, etc.) Digital components and the storage format (character encoding,
binary format, record structure, etc.)
– Primary data: corpus files– Ancillary data: corpus documentation
25/03/2004ALR04 - Sanya, China5
Corpus development standards (3)
Data formats– Primary data
Text files: XML/SGML conforming to a standard or supplied DTD or schema
Audio: MP3 or WAV Video: MPEG or Quicktime Image files: PNG or JPG
– Ancillary data Documentation: PDF, HTML, or XML
25/03/2004ALR04 - Sanya, China6
Corpus development standards (4)
File structure, markup and annotation– Corpus header
providing metadata about the corpus file TEI/CES-compliance
– Corpus body Containing the corpus data TEI/CES-compliance Markup for paragraphs and sentences Preferably annotated with various levels of linguistic analysis (POS
tagging…)
Character encoding– Unicode-compliance (UTF-8/16)
25/03/2004ALR04 - Sanya, China7
The EMILLE project
The EMILLE project– Funded by the UK EPSRC (Grant references GR/N19106,
GR/M70735, GR/N28542 and GR/R42429/01)– Research partners: Lancaster University, Sheffield University,
and the Central Institute of Languages (CIIL) in Mysore, India– Three main goals
To build corpora of South Asian languages To extend the GATE (General Architecture for Text Engineering)
LE architecture To develop basic LE tools
– Project site: http://www.emille.lancs.ac.uk/– GATE: http://gate.ac.uk/sale/tao/index.html#x1-550002.26
25/03/2004ALR04 - Sanya, China8
The EMILLE Corpus: An overview
Three components– Monolingual, Annotated, and Parallel
14 South Asian languages– Spoken data for five language
Monolingual corpora contain more than 96 million words– Spoken data over 2.6 million words
The Urdu corpus is POS tagged Part of the Hindi corpus is annotated for anaphora Parallel corpus covers English and five South Asian languages Corpus building tools: Uni-codify, Uni-viewer, Uni-editor
25/03/2004ALR04 - Sanya, China9
The EMILLE Monolingual Corpora
Language Written Spoken Total
Assamese 2,620,000 0 2,620,000
Bengali 5,520,000 442,000 5,962,000Gujarati 12,150,000 564,000 12,714,000Hindi 12,390,000 588,000 12,978,000Kannada 2,240,000 0 2,240,000Kashmiri 2,270,000 0 2,270,000Malayalam 2,350,000 0 2,350,000
Marathi 2,210,000 0 2,210,000Oriya 2,730,000 0 2,730,000Punjabi 15,600,000 521,000 16,121,000Sinhala 6,860,000 0 6,860,000
Tamil 19,980,000 0 19,980,000
Telugu 3,970,000 0 3,970,000
Urdu 1,640,000 512,000 2,152,000
Total 93,530,000 2,627,000 96,157,000
25/03/2004ALR04 - Sanya, China10
The EMILLE Annotated Corpora
POS tagging– The whole monolingual Urdu corpus– The Urdu component of the EMILLE Parallel
Corpora
Anaphoric annotation– Around 100,000 words of news material (20
excerpts from the Ranchi Express data) from the Hindi Monolingual Corpus
25/03/2004ALR04 - Sanya, China11
The EMILLE Parallel Corpus
75 advice leaflets published by the UK government
Approximately 200,000 words of English originals with accompanying translations in five South Asian languages– Hindi, Bengali, Punjabi, Gujarati, and Urdu
Covering a range of term-rich domains
25/03/2004ALR04 - Sanya, China12
The EMILLE corpus building tools
Uni-codify– Allows users to convert 30 (or so) different 8-bit encodings of
South Asian scripts into 16-bit little-endian Unicode format– Compiled program accompanied by documentation
Uni-Viewer– Allows users to view Unicode texts
Uni-Editor– Allows users to edit Unicode texts
Urdu POS tagger– POS tagging Unicode-encoded Urdu texts– Accompanied by the tagset and the user manual
25/03/2004ALR04 - Sanya, China13
The EMILLE Corpus: Availability
The full release of the EMILLE Corpus and tools are distributed free of charge for use in non-profit-making research
Digital sound files will also be released soon Indexed version for use with Xara will be
available soon Corpus download site
– http://www.ling.lancs.ac.uk/corplang/emille
25/03/2004ALR04 - Sanya, China14
The LCMC Corpus: Aims
Built for the ESRC project Contrasting tense and aspect in English and Chinese (Grant Ref. RES-000-220135)
A Chinese match for FLOB/Frown for BrE/AmE A publicly available balanced corpus of
Mandarin Chinese Distributed free of charge for use in non-profit-
making research
25/03/2004ALR04 - Sanya, China15
LCMC: Profile
One million words 1990-1993 15 text categories 500 text samples Major text provider: SSReader Digital Library in China Unicode (UTF-8) XML-conformant mark-up Marked for paragraphs and sentences POS-tagged (precision rate 98%+) Standard character and Romanized Pinyin versions
25/03/2004ALR04 - Sanya, China16
Major Chinese corpus resources
Corpus POS Bal. Channel Variety Contr.
LCMC Yes Yes Written Mainland E – C
Sinica Yes Yes Mixed Taiwan No
PH No No Written Mainland No
PFR Yes No Written Mainland No
LIVAC No No Written Mixed C – C
SCCSD No Yes Spoken Mainland No
TREC No No Written Mainland No
Gigaword No No Written Mainland No
Callhome No ? Spoken Mixed No
25/03/2004ALR04 - Sanya, China17
LCMC: Sampling frame
Code Text category Samples Proportion
A Press reportage 44 8.8%B Press editorials 27 5.4%C Press reviews 17 3.4%D Religion 17 3.4%E Skills/trades/hobbies 38 7.6%F Popular lore 44 8.8%G Biographies/essays 77 15.4%H Miscellaneous 30 6%J Science 80 16%K General fiction 29 5.8%L Mystery/detective fiction 24 4.8%
M Science fiction 6 1.2%N Western/adventure fiction 29 5.8%
P Romantic fiction 29 5.8%R Humor 9 1.8%Total 500 100%
25/03/2004ALR04 - Sanya, China18
LCMC: Markup
Level Code Gloss Attribute Value
1 text Text type TYPE As per Table 2 Text Category
ID As per Table 2 Code
2 file Corpus file ID Text ID plus file number starting from 01
3 p Paragraph --- ---
4 s Sentence n Starting from 0001 onwards
5 w Word POS Part-of-speech tags as per the LCMC tagsetc Punctuation
and symbol
gap Omission --- ---
25/03/2004ALR04 - Sanya, China19
LCMC: Annotation
Segmentation POS tagging
– Applying the Peking University tagset 26 Level 1 POS tags 50 Level 2 POS tags
– ICTCLAS (Chinese Lexical Analysis System) Developed by the Institute of Computing Technology, Chinese Academy
of Sciences (Zhang & Liu 2002) A frequency dictionary of 80,000 words Based on a multi-layer hidden Markov model Applying the n-shortest paths method
– Automatic tagging with a precision rate of 97.16%– Post-editing improved the precision to over 98%
25/03/2004ALR04 - Sanya, China20
LCMC: Potential use
Monolingual study– Studying Mandarin Chinese as a whole– Exploring variation across text categories
Contrastive study (in conjunction with FLOB/Frown)– Contrasting Chinese and BrE/AmE– Contrasting text categories in Chinese and English
25/03/2004ALR04 - Sanya, China21
LCMC: Availability
Distributed free of charge for use in non-profit-making research
Accompanied by the user manual Online search available via WebConc The LCMC website
– http://www.ling.lancs.ac.uk/corplang/lcmc The Chinese mirror site (Chinese Academy of Social
Science)– http://www.cass.net.cn/chinese/s18_yys/dangdai/LCMC/
LCMC.htm
25/03/2004ALR04 - Sanya, China22
Corpus exploration tools
XML-aware, Unicode-compliant corpus exploration tools– The WordSmith Tools version 4
Presently under beta test Beta version available
– http://www.lexically.net/wordsmith/version4/index.htm
– Xara (XML-aware Sara) Sara: SGML-aware Retrieval Application
– For use with the British National Corpus (BNC) For either local or remote access Presently under beta test Documentation available at http://www.oucs.ox.ac.uk/rts/xara/ A tutorial available at the LCMC website
25/03/2004ALR04 - Sanya, China23
Software demonstration
Using Xara for local access to LCMC– Query types: Quick query, word query (pattern), POS query,
pattern query (regex), Query builder (e.g. a-n vs. a-de-n), etc– Display mode: KWIC mode vs. sentence mode– Display format: Plain vs. XML– Status bar: Reference– Other useful features: distribution, sort, collocation, partition,
user-defined stylesheets, etc. Using Xara to for local access to EMILLE Using WebConC to access LCMC
– http://www.ling.lancs.ac.uk/corplang/lcmc