Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Current status of Vietnamese Treebank usefulness of collaboration with
Asian Language Treebank
VU TAT THANG
Dept. of Multimedia Human-Machine Language Technology,
Institute of Information Technology,
Vietnam Acadamy of Science and Technology.
Content
IOIT and International Collaborations
Vietnamese Language
VLSP Standard
Current status of Vietnamese Processing
Propose idea
26/11/2015
Content
IOIT and International Collaborations
Vietnamese Language
VLSP Standard
Current status of Vietnamese Processing
Propose idea
26/11/2015
IOIT – a member of ASEAN MT
Member of “Network-based ASEAN Languages Translation Public Service Project”, 2012- 2015. Lead by NECTEC – Thailand
The communication among people in the ASEAN region has increased gradually and will become extreme especially after 2015 when the ASEAN Community begins. The automatic machine translation (MT) system has become more and more important to facilitate the cross-language communication, but has been limited for ASEAN countries.
Sharing language data
Develop platform
Integration of translation system
IOIT – a member of A-STAR (U-STAR)
A-STAR (Asian Speech Translation Advanced Research), 2008-2010
U-STAR (Universal Speech Translation Advanced Research),
2010 – till now
26/11/2015
Multilingual
Speech
Recognition
Large Scale
Vietnamese Speech
Corpora
Large Scale Parallel
Corpora of
Vietnamese and English
Large Scale
English Speech
Corpora
Mechanism of S2s system
Spoken
Language
Translation
Multilingual
Speech
Synthesis Vietnamese
English
I go to school Tôi đi đén trường
Large Scale
VietnameseText
Corpora
Corpus-based
Speech Synthesis
I go to school
Corpora
Statistical Speech
Recognition
Statistical Machine
Translation +
Multi-engine approach
Large Scale
English Text
Corpora
Tôi đi đén trường
Content
IOIT and International Collaborations
Vietnamese Language
VLSP Standard
Current status of Vietnamese Processing
Propose idea
26/11/2015
8 8
Spoken as mother tongue by
86% of Vietnam’s population
~ 3 million overseas Vietnamese – most live in US
It is part of the Austro-asiatic
language family (168 languages)
Many vocabulary has been borrowed from Chinese
Writing system:
Formerly, Chinese writing system
Today: Latin alphabet, with additional diacritics for tones and certain letters
Dialects: Northern, Central, Southern
Vietnamese Language
26/11/2015
Vietnamese language was
established a long time ago
Chinese characters was
used for a long time
Unique writing system of
Vietnam called Chu Nom
(字喃) in the 10th century
Romanced script to
represent the Quốc Ngữ
since the beginning of the
20th century
Nam quốc sơn hà Nam đế cư
南 国 山 河 南 帝 居 Over Mountains and Rivers of the
South, Reigns the Emperor of the South
Vietnamese language
Content
IOIT and International Collaborations
Vietnamese Language
VLSP Standard
Current status of Vietnamese Processing
Propose idea
26/11/2015
Setting up the VLSP “standards” for the public
Importance of “standards” in VLSP: choose an unified
view from various schools on Vietnamese language
Guide for words recognition and description:
morphological, syntactic, semantic criteria
Guide for constituent labeling: noun phrase, verb
phrase, clause, etc.
Guide for sentence split
Others
26/11/2015
National project with eleven
active research groups on
VLSP (Vietnamese Language
and Speech Processing)
Building VLSP infrastructure,
especially indispensable
resources and tools for the
VLSP development.
Building and developing
several typical VLSP
products for public end-
users.
VLSP national project
Natural language
processing
methods
Pragmatics:
Speech, text
and Web data
mining
Tools,
corpora,
resources
26/11/2015
some ML/Stat no ML/Stat
Pages 11-12 from Marie Claire, ECML/PKDD 2005
ML and statistical methods in NLP
Word Segmentation
Considering words "nhà cửa", "sắc đẹp", "hiệu sách". They are words in the following sentences: a. Nhà cửa bề bộn quá b. Cô ấy giữ gìn sắc đẹp. c. Ngoài hiệu sách có bán cuốn này
And they are not words in: a. Ở nhà cửa ngõ chẳng đóng gì cả. b. Bức này màu sắc đẹp hơn. c. Ngoài cửa hiệu sách báo bày la liệt.
26/11/2015
Many tools such as ChaSen, Yamcha, …
このひとことで元気になった
to do such a simple task
Example: Guideline for POS tagging
36 word labels in
English, from Penn
Treebank (1989)
30 word labels in
Chinese, from
Chinese TreeBank
(1998)
47 word labels in Thai,
from Orchid corpus
(1997)
How many for
Vietnamese?
SP7.3
Vietnamese treebank
SP7.4
E-V corpora of aligned
sentences
SP3
English-Vietnamese
translation system
SP4
IREST: Internet use
support system
SP5
Vietnamese spelling
checker
SP8.2
Vietnamese word
Segmentation
SP8.3
Vietnamese POS tagger
SP8.4
Vietnamese chunker
SP8.5
Vietnamese syntax
analyser
SP7.1
English-Vietnamese
dictionary
SP7.2
Viet dictionary
SP1
Apllicationoriented
systems based on
Vietnamese speech
recognition & synthesis
SP2
Speech recognition
system with
large vocabulary
SP8.1
Speech analysis tools
SP6.1
Corpora for
speech recognition
SP6.2
Corpora for
speech synthesis
SP6.3
Corpora for
specific words
Project target products
To be standard
for long term
development
Content
IOIT and International Collaborations
Vietnamese Language
VLSP Standard
Current status of Vietnamese Processing
Propose idea
26/11/2015
NLP tools + resources
All the tools: Word segmentation, POS tagging, Chunking, Syntax analysis are constructed based on the same view of words, label assignment, sentences, Viet dictionary and Viet Treebank.
Using statistical and machine learning methods in building such tools.
All the tools and resources is given to the R&D community.
26/11/2015
Vietnamese WordNet 2012-2015
Developing Vietnamese WordNet with the following features: Vietnamese WordNet with 50.000 words (30.000 popular words
and 20.000 domain-based)
30.000 synset
Accuracy: 95% for terms in the same synset, 90% in the relationship between different synsets
Develop API for WordNet users
Develop a tool to access, verify and update
Propose guideline for long term WordNet development
NLP Resources
VietTreebank 10,000 trees; 1,000,000 morphemes
Tools: text graphical edit, log and history view, agreement check, search by words, syntactic patterns
Vietnamese Machine Readable Dictionary Model of VCL (Vietnamese Computational Lexicon) by learning from
other language’s MRDs with morphological, syntactic and semantic information.
35,000 Vietnamese common used words in modern Vietnamese
Develop a tool for building VCL with XML representation
SP7.3
Vietnamese treebank
SP7.4
E-V corpora of aligned
sentences
SP3
English-Vietnamese
translation system
SP4
IREST: Internet use
support system
SP5
Vietnamese spelling
checker
SP8.2
Vietnamese word
Segmentation
SP8.3
Vietnamese POS tagger
SP8.4
Vietnamese chunker
SP8.5
Vietnamese syntax
analyser
SP7.1
English-Vietnamese
dictionary
SP7.2
Viet dictionary
SP1
Apllicationoriented
systems based on
Vietnamese speech
recognition & synthesis
SP2
Speech recognition
system with
large vocabulary
SP8.1
Speech analysis tools
SP6.1
Corpora for
speech recognition
SP6.2
Corpora for
speech synthesis
SP6.3
Corpora for
specific words
Project target products
To be standard
for long term
development
Ông già
S
NP VP
P V
đi
NP
T
nhanh quá
SP7.3: Viet Treebank
A Treebank or parsed corpus is a text corpus in
which each sentence has been parsed, i.e.
annotated with syntactic structure.
English: Penn Treebank (4.5M words) and many
others;
Chinese: Penn Chinese Treebank (507K words),
Sinica Treebank (61,087 trees, 361K words); Japanese: ATR Dependency corpus, Kyoto Text
Corpus, Verbmobil treebanks;
Korean: Korean Treebank
(5078 trees, 54K words)
Viet Treebank (2012):
10,000 trees
1,000,000 morphemes Viet machine translation, info extraction, etc.
Viet Treebank
Viet syntactic parser
Viet chunker
Viet POS tagger
Viet word segmenter
Study various existing treebanks, modern theories for
syntax and Vietnamese language
Build guidelines for word segmentation, POS, and syntax
“Nhà cửa bề bộn quá” and “Ở nhà cửa ngõ chẳng đóng gì cả”
(“the house is in jumble” and “at home the door is not closed”)
“Cô ấy giữ gìn sắc đẹp” and “Bức này màu sắc đẹp hơn”
(She keeps her beauty” and “this painting has better color”)
Build the tools
Labeling Agreement between labelers (95%)
SP7.3: Viet Treebank
NLP Tools
Word segmentation Methods: n-gram + dictionary + regular expression
97,1% based on VieTreebank with annotated 220.000 vietnamese words
98,2% based on 100 sentences not included in VieTreebank
POS tagger Methods: MEMs, CRFs
Training: 20.000 sentences with POS from VieTreebank and VN dictionary
90%
Syntactic parser 1 Method: HPSG grammar
P = 82%, R = 74%, F-score = 78% tested on 100 sentences in VieTreebank
• Syntactic parser 2 Method: LPCFG, Bikel’s implementation
F-score = 78% tested on 9600 sentences in VieTreebank
Chunker CRF, online learning on > 9.000 sentences with POS as in VieTreebank
94% 26/11/2015
Content
IOIT and International Collaborations
Vietnamese Language
VLSP Standard
Current status of Vietnamese Processing
Propose idea
26/11/2015
We need Asian Language Treebank
ALT is the key resources of most of Asian languages.
Can constructs from multi-lingual corpora among all Asian languages with
The same standard of infrastructure
The same kind of tool
…
Accelerates research of NLP for Asian languages
We have Treebank for English, Japanese, Vietnamese
How about Indonesian, Thai, Khmer, Laos, Malay, Myammar, Philippine..
2015/11/26
Word segmenter
POS tagger
Chunker
Syntactic parser
….
Search engines,
Information retrieval
machine translation
QA system
….