Upload
jason-jackson
View
215
Download
1
Tags:
Embed Size (px)
Citation preview
Language Resources in Indonesia
Language Technology & Applied Information Laboratory
Directorate for Information Technology and Electronics Agency for the Assessment & Application
of Technology (BPPT)
Indonesia
TBIT Laboratory - BPPT Apply, assess and develop Language Technology &
Applied Information Technology supporting Government’s program in development of IT & Electronics in Indonesia
Advise and setup government national policy in developing language technology and information technology
Develop and deploy language technologies in the area of language processing, text analysis and generation, information retrieval and extraction, machine translation
Develop and maintain Language Resources i.e. grammar rules, electronic dictionaries and annotated corpus
Develop Electronic Data Interchange (EDI) and Electronic Commerce suite for SME
Project Portfolio Multilingual Machine Translation System (CICC-MMTS) KEBI (Indonesian Electronic Dictionaries) UNL (Universal Networking Language) INCI (Indonesian National Corpus Initiative) Online I-E Dictionary on news portal (Detik.com) Multimedia Dictionary (including speech synthesizer) Yanetra (NLP tools for the blind) Others
Manufacturing Technology supported by advanced and integrated information system through International Cooperation (MATIC) for Automotive, Apparel, and Electronics
Web Information Gateway for Apparel Electronic Commerce Projects
Indonesian Electronic Dictionaries - KEBI
Word dictionary (50K root words ~250K derivational words)
Concept dictionary Co-occurrence dictionary Terminology dictionary (15K terms)
Indonesian-English Online Dictionary Indonesia-English Online Dictionary on Detik.com Portal (number
1 for online breaking news)
English Summarization
English Summarization
MiniMiniWeb Pages with English
word links
Web Pages with English
word links
Online Dictionary
Online Dictionary
New
s A
rtic
leN
ews
Art
icle
Dynamic HTML
Generator
Dynamic HTML
Generator
Con
tent
Man
agem
ent S
yste
mC
onte
nt M
anag
emen
t Sys
tem
U
ser
U
ser
Indonesian National Corpus Initiative INCI/KNBI
Source from national news agency LKBN ANTARA
50.000 sentences
~ 1 million words
ambiguous word-type
ambiguous word-token
POS and phrase attachment ambiguity
[NP <JAKSA:IDNCC$IN11135> <AGUNG:IDAJGP$IAJVA> NP] [VP<BERIKAN:IDVT/IDVBT$IVPBN> [NP <CERAMAH:IDNCA$INCAEV> NP] [PP<DI:IDPP$IPPLA> [NP <DEPARTEMEN:IDNCA$INCAOR> <KEUANGAN:IDNCA$INCACT>NP] VP][NP <Jaksa:IDNCC$IN11135> <Agung:IDAJGP$IAJVA> <Sukarton:IDNM$null><Marmosudjono:IDNM$null> <SH:IDNM$INMTL> NP] [ADP <hari:IDNCA$INCATM><Jumat:IDNM$INMDY> ADP] [PP <di:IDPP$IPPLA> <hadapan:IDNCA$INCALC> [NP<Menteri:IDNCC$IN11135> <Keuangan:IDNCA$INCACT> <Menteri:IDNCC$IN11135><Muda:IDAJGP$IAJST> <Keuangan:IDNCA$INCACT> <Menteri:IDNCC$IN11135><Perdagangan:IDNCA$INCACT> <dan:IDCJCO$ICJCOAD> <para:IDPP$IPPACC><pejabat:IDNCA$INCACT> <Eselon-I:IDNM$null> <lingkup:IDNCA$INACC><Departemen:IDNCA$INCAOR> <Keuangan:IDNCA$INCACT> NP] PP] [VP<mengadakan:IDVT/IDVBT$IVABS> [NP <pemaparan:IDNCA$INCAAC><tentang:IDAJGP$IAJGT> <kejahatan:IDNCA$INCACT> <korupsi:IDNCA$INCASD><dan:IDCJCO$ICJCOAD> <penyelundupan:IDNCA$INCAAC> NP] VP]
BIAS (Bahasa Indonesia Analysis System)
Part of CICC-MMTS Improvement using stochastic-
symbolic approach Supervised and unsupervised
learning 15.000 sentences of annotated
corpus (based on GDA tagset) ISTAG (POS Tagger) ISPARSE (Skeleton Parser)
UNL Project
12
Universal Networking Language (UNL).
- Deconverter & Enconverter System
- UNL graph displayer System
- UNL editor System
- Indonesia Language Server :
http:// unlserver .aia .bppt.go.id