8
LANGUAGE RESOURCES IN MALAYSIA Zaharin Yusoff Computer-Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia 11800 Penang, Malaysia Symposium on Language Resources in Asia

LANGUAGE RESOURCES IN MALAYSIA Zaharin Yusoff Computer-Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia 11800 Penang, Malaysia

Embed Size (px)

Citation preview

Page 1: LANGUAGE RESOURCES IN MALAYSIA Zaharin Yusoff Computer-Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia 11800 Penang, Malaysia

LANGUAGE RESOURCES IN MALAYSIA

LANGUAGE RESOURCES IN MALAYSIA

Zaharin YusoffComputer-Aided Translation Unit

School of Computer SciencesUniversiti Sains Malaysia11800 Penang, Malaysia

Symposium on Language Resources in Asia

Page 2: LANGUAGE RESOURCES IN MALAYSIA Zaharin Yusoff Computer-Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia 11800 Penang, Malaysia

HISTORICAL PERSPECTIVE

1977 1980 1990 2000

USMGETA

UTMKMT

MT, MAHTCL TOOLS

NLP APPLICATIONS

UTMCICC

MTITNM

TRANSLATION

UKMNLP

UM

UiTM

MT

CALL

UNIVERSITI SAINS MALAYSIA (USM)Unit Terjemahan Melalui komputer (UTMK)

UNIV. TEKNOLOGI MALAYSIA (UTM)

INSTITUT TERJEMAHAN NEGARA (ITNM)

UNIV. KEBANGSAAN MALAYSIA (UKM)

UNIVERSITI MALAYA (UM)

UNIV. Institut TEKNOLOGI MARA (UiTM)

DEWAN BAHASA DAN PUSTAKA (DBP)

Page 3: LANGUAGE RESOURCES IN MALAYSIA Zaharin Yusoff Computer-Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia 11800 Penang, Malaysia

LINGWAREDATA

APPLICATIONBASED

GENERIC TOOLS

LINGUISTIC DATA

COMP. LING.TOOLS

MAIN POINTS

•NOT TOO MANY•MOSTLY NOT UPDATED•SOME ARE REUSABLE

LANGUAGERESOURCES

•THE MORE RECENT ONES•DEPENDENT ON DEMAND•BUT MODULAR & REPROGRAMMABLE

LANGUAGEDATA

•VERY LITTLE•NOT REUSABLE•METHODOLOGIES OK

•REASONABLE•SOME INCOMPLETE•DIFFICULT TO ACQUIRE•BUT REUSABLE

RECALL:•Too Few Researchers (60 at peak in 1991, now 15)•Lacking in Formal Linguistic Studies for Malay•Lack of Culture of Data Accumulation

Page 4: LANGUAGE RESOURCES IN MALAYSIA Zaharin Yusoff Computer-Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia 11800 Penang, Malaysia

LINGUISTIC RESOURCES

GENERIC TOOLS

•MT software: JEMAH

•Automatic Generator of Lingware•Analysis•Synthesis

•User-Driven MT Sytem

•Language Tools:-Spellchecker-Desktop Accessories (Dicts)-Text Analysis-etc.

•Linguistic Tools:-Corpus System-Dictionary System-Grammar Editor (STCG)-Bilingual Corpus Bank-etc.

APPLICATION BASED TOOLS

•MAHT system: SISKEP

•Example Based MT

•EDI (parsing/generation msg. types)

•Semantic Driven Search Engine

•WEB Crawler

•Internet Portal (??)

•NOT TOO MANY•MOSTLY NOT UPDATED•SOME ARE REUSABLE

•THE MORE RECENT ONES•DEPENDENT ON DEMAND•BUT MODULAR & REPROGRAMMABLE

LINGWARE DATA

•Ariane/Jemah MT English->Malay (all phases)

•STCG Malay Grammar

•VERY LITTLE•NOT REUSABLE•METHODOLOGIES OK

Page 5: LANGUAGE RESOURCES IN MALAYSIA Zaharin Yusoff Computer-Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia 11800 Penang, Malaysia

LANGUAGE DATA

DICTIONARIES (WINHELP)

•ENGLISH-MALAY DBP (KIMD) 10.16 MB 1945 pages

•MALAY DBP (KD) 6.63 MB 1566 pages

•TERMINOLOGIES (MABBIM) 8.13 MB 1069 pages

•COMPUTER (Malay) 1.15 MB …

•FRENCH-ENGLISH-MALAY 3.57 MB …

DICTIONARIES (Databases: attribute format)

•KIMD (as above) missing data B,O,R,S,T,U,V,X,Y,Z

•KD (as above) alphabet A only (1,544 words)

•MALAY THESAURUS

CORPUS

•Malay Books, Letters to Editor (System)2.2 million words

•Translations (Malay only in MS Word) 23 titles (average 1.5 MB, 350 pages)

•English-Malay (Parallel Text) 3 titles (1 with sentence alignment)

•REASONABLE•SOME INCOMPLETE•DIFFICULT TO ACQUIRE•BUT REUSABLE

Page 6: LANGUAGE RESOURCES IN MALAYSIA Zaharin Yusoff Computer-Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia 11800 Penang, Malaysia

LANGUAGE DATA (cont..)

KIMD-WordNet Link (A->F only)

Sources are KIMD and WordNet, and linked by sense entry in Wordnet and KIMD, e.g.

abacus

KIMD(abacus,n,1 [device, for, calculating, ’,’, a, square, or, rectangular, frame, ….]).***(entry and definition taken from KIMD

– some redefined to fit)

WORDNET(102155519, 1, ‘abacus’, n, 2, 0, [performs, arithmetic, functions, by, ….]).***(entry and definition taken from Wordnet)

===sepua, sempoa, dekak-dekak***(Malay equivalent taken from KIMD)

KD Sense Processing (A->Z)

Source is KAMUS DEWAN (KD)Steps of process:

• Extract word senses (ws) from KD (result: approx. 30K ws with definition)

• Extract primitive words (ps) from KD based on frequency (result: approx. 5K ps with definition)

• Extract synonyms from KD (result: approx. 6K synonyms)

• Use KD sense numbering to tag synonyms. Example of result: syn_kd(adem1, sejuk1) syn_kd(adem3, tenang2)

Page 7: LANGUAGE RESOURCES IN MALAYSIA Zaharin Yusoff Computer-Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia 11800 Penang, Malaysia

LANGUAGE DATA (cont..)

OTHER POSSIBLE SOURCES OF DATA

DEWAN BAHASA DAN PUSTAKA (LANGUAGE ACADEMY)

•Copies of all types in UTMK (perhaps more volume)

•Corpus: more recent publications (books, novels, journals, etc.)

NEWSPAPERS

•Corpus: more recent years, i.e. since publishing on internetSTAR, NEW STRAINTS TIMES, etc.

OTHER R&D CENTRES

•UNIV. TEKNOLOGI MALAYSIA (UTM)

•INSTITUT TERJEMAHAN NEGARA (ITNM)

•UNIV. KEBANGSAAN MALAYSIA (UKM)

•UNIVERSITI MALAYA (UM)

•UNIV. Institut TEKNOLOGI MARA (UiTM)

Page 8: LANGUAGE RESOURCES IN MALAYSIA Zaharin Yusoff Computer-Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia 11800 Penang, Malaysia

THANK YOU

ARIGATO

MERCI

SHUKRIYA

GRAZZIE

XIE-XIE NI

TERIMA KASIH