Upload
felix-stone
View
214
Download
2
Embed Size (px)
Citation preview
LANGUAGE RESOURCES IN MALAYSIA
LANGUAGE RESOURCES IN MALAYSIA
Zaharin YusoffComputer-Aided Translation Unit
School of Computer SciencesUniversiti Sains Malaysia11800 Penang, Malaysia
Symposium on Language Resources in Asia
HISTORICAL PERSPECTIVE
1977 1980 1990 2000
USMGETA
UTMKMT
MT, MAHTCL TOOLS
NLP APPLICATIONS
UTMCICC
MTITNM
TRANSLATION
UKMNLP
UM
UiTM
MT
CALL
UNIVERSITI SAINS MALAYSIA (USM)Unit Terjemahan Melalui komputer (UTMK)
UNIV. TEKNOLOGI MALAYSIA (UTM)
INSTITUT TERJEMAHAN NEGARA (ITNM)
UNIV. KEBANGSAAN MALAYSIA (UKM)
UNIVERSITI MALAYA (UM)
UNIV. Institut TEKNOLOGI MARA (UiTM)
DEWAN BAHASA DAN PUSTAKA (DBP)
LINGWAREDATA
APPLICATIONBASED
GENERIC TOOLS
LINGUISTIC DATA
COMP. LING.TOOLS
MAIN POINTS
•NOT TOO MANY•MOSTLY NOT UPDATED•SOME ARE REUSABLE
LANGUAGERESOURCES
•THE MORE RECENT ONES•DEPENDENT ON DEMAND•BUT MODULAR & REPROGRAMMABLE
LANGUAGEDATA
•VERY LITTLE•NOT REUSABLE•METHODOLOGIES OK
•REASONABLE•SOME INCOMPLETE•DIFFICULT TO ACQUIRE•BUT REUSABLE
RECALL:•Too Few Researchers (60 at peak in 1991, now 15)•Lacking in Formal Linguistic Studies for Malay•Lack of Culture of Data Accumulation
LINGUISTIC RESOURCES
GENERIC TOOLS
•MT software: JEMAH
•Automatic Generator of Lingware•Analysis•Synthesis
•User-Driven MT Sytem
•Language Tools:-Spellchecker-Desktop Accessories (Dicts)-Text Analysis-etc.
•Linguistic Tools:-Corpus System-Dictionary System-Grammar Editor (STCG)-Bilingual Corpus Bank-etc.
APPLICATION BASED TOOLS
•MAHT system: SISKEP
•Example Based MT
•EDI (parsing/generation msg. types)
•Semantic Driven Search Engine
•WEB Crawler
•Internet Portal (??)
•NOT TOO MANY•MOSTLY NOT UPDATED•SOME ARE REUSABLE
•THE MORE RECENT ONES•DEPENDENT ON DEMAND•BUT MODULAR & REPROGRAMMABLE
LINGWARE DATA
•Ariane/Jemah MT English->Malay (all phases)
•STCG Malay Grammar
•VERY LITTLE•NOT REUSABLE•METHODOLOGIES OK
LANGUAGE DATA
DICTIONARIES (WINHELP)
•ENGLISH-MALAY DBP (KIMD) 10.16 MB 1945 pages
•MALAY DBP (KD) 6.63 MB 1566 pages
•TERMINOLOGIES (MABBIM) 8.13 MB 1069 pages
•COMPUTER (Malay) 1.15 MB …
•FRENCH-ENGLISH-MALAY 3.57 MB …
DICTIONARIES (Databases: attribute format)
•KIMD (as above) missing data B,O,R,S,T,U,V,X,Y,Z
•KD (as above) alphabet A only (1,544 words)
•MALAY THESAURUS
CORPUS
•Malay Books, Letters to Editor (System)2.2 million words
•Translations (Malay only in MS Word) 23 titles (average 1.5 MB, 350 pages)
•English-Malay (Parallel Text) 3 titles (1 with sentence alignment)
•REASONABLE•SOME INCOMPLETE•DIFFICULT TO ACQUIRE•BUT REUSABLE
LANGUAGE DATA (cont..)
KIMD-WordNet Link (A->F only)
Sources are KIMD and WordNet, and linked by sense entry in Wordnet and KIMD, e.g.
abacus
KIMD(abacus,n,1 [device, for, calculating, ’,’, a, square, or, rectangular, frame, ….]).***(entry and definition taken from KIMD
– some redefined to fit)
WORDNET(102155519, 1, ‘abacus’, n, 2, 0, [performs, arithmetic, functions, by, ….]).***(entry and definition taken from Wordnet)
===sepua, sempoa, dekak-dekak***(Malay equivalent taken from KIMD)
KD Sense Processing (A->Z)
Source is KAMUS DEWAN (KD)Steps of process:
• Extract word senses (ws) from KD (result: approx. 30K ws with definition)
• Extract primitive words (ps) from KD based on frequency (result: approx. 5K ps with definition)
• Extract synonyms from KD (result: approx. 6K synonyms)
• Use KD sense numbering to tag synonyms. Example of result: syn_kd(adem1, sejuk1) syn_kd(adem3, tenang2)
LANGUAGE DATA (cont..)
OTHER POSSIBLE SOURCES OF DATA
DEWAN BAHASA DAN PUSTAKA (LANGUAGE ACADEMY)
•Copies of all types in UTMK (perhaps more volume)
•Corpus: more recent publications (books, novels, journals, etc.)
NEWSPAPERS
•Corpus: more recent years, i.e. since publishing on internetSTAR, NEW STRAINTS TIMES, etc.
OTHER R&D CENTRES
•UNIV. TEKNOLOGI MALAYSIA (UTM)
•INSTITUT TERJEMAHAN NEGARA (ITNM)
•UNIV. KEBANGSAAN MALAYSIA (UKM)
•UNIVERSITI MALAYA (UM)
•UNIV. Institut TEKNOLOGI MARA (UiTM)
THANK YOU
ARIGATO
MERCI
SHUKRIYA
GRAZZIE
XIE-XIE NI
TERIMA KASIH