Current status of Vietnamese Treebank usefulness of ......Vietnamese WordNet with 50.000 words (30.000 popular words and 20.000 domain-based) 30.000 synset Accuracy: 95% for terms

Current status of Vietnamese Treebank usefulness of collaboration with

Asian Language Treebank

VU TAT THANG

Dept. of Multimedia Human-Machine Language Technology,

Institute of Information Technology,

Vietnam Acadamy of Science and Technology.

Content

IOIT and International Collaborations

Vietnamese Language

VLSP Standard

Current status of Vietnamese Processing

Propose idea

26/11/2015

IOIT – a member of ASEAN MT

Member of “Network-based ASEAN Languages Translation Public Service Project”, 2012- 2015. Lead by NECTEC – Thailand

The communication among people in the ASEAN region has increased gradually and will become extreme especially after 2015 when the ASEAN Community begins. The automatic machine translation (MT) system has become more and more important to facilitate the cross-language communication, but has been limited for ASEAN countries.

Sharing language data

Develop platform

Integration of translation system

IOIT – a member of A-STAR (U-STAR)

A-STAR (Asian Speech Translation Advanced Research), 2008-2010

U-STAR (Universal Speech Translation Advanced Research),

2010 – till now

26/11/2015

Multilingual

Speech

Recognition

Large Scale

Vietnamese Speech

Corpora

Large Scale Parallel

Corpora of

Vietnamese and English

Large Scale

English Speech

Corpora

Mechanism of S2s system

Spoken

Language

Translation

Multilingual

Speech

Synthesis Vietnamese

English

I go to school Tôi đi đén trường

Large Scale

VietnameseText

Corpora

Corpus-based

Speech Synthesis

I go to school

Corpora

Statistical Speech

Recognition

Statistical Machine

Translation +

Multi-engine approach

Large Scale

English Text

Corpora

Tôi đi đén trường

Content


Vietnamese Language

VLSP Standard


Propose idea

26/11/2015

8 8

Spoken as mother tongue by

86% of Vietnam’s population

~ 3 million overseas Vietnamese – most live in US

It is part of the Austro-asiatic

language family (168 languages)

Many vocabulary has been borrowed from Chinese

Writing system:

Formerly, Chinese writing system

Today: Latin alphabet, with additional diacritics for tones and certain letters

Dialects: Northern, Central, Southern

Vietnamese Language

26/11/2015

Vietnamese language was

established a long time ago

Chinese characters was

used for a long time

Unique writing system of

Vietnam called Chu Nom

(字喃) in the 10th century

Romanced script to

represent the Quốc Ngữ

since the beginning of the

20th century

Nam quốc sơn hà Nam đế cư

南国山河南帝居 Over Mountains and Rivers of the

South, Reigns the Emperor of the South

Vietnamese language

Content


Vietnamese Language

VLSP Standard


Propose idea

26/11/2015

Setting up the VLSP “standards” for the public

Importance of “standards” in VLSP: choose an unified

view from various schools on Vietnamese language

Guide for words recognition and description:

morphological, syntactic, semantic criteria

Guide for constituent labeling: noun phrase, verb

phrase, clause, etc.

Guide for sentence split

Others

26/11/2015

National project with eleven

active research groups on

VLSP (Vietnamese Language

and Speech Processing)

Building VLSP infrastructure,

especially indispensable

resources and tools for the

VLSP development.

Building and developing

several typical VLSP

products for public end-

users.

VLSP national project

Natural language

processing

methods

Pragmatics:

Speech, text

and Web data

mining

Tools,

corpora,

resources

26/11/2015

some ML/Stat no ML/Stat

Pages 11-12 from Marie Claire, ECML/PKDD 2005

ML and statistical methods in NLP

Word Segmentation

Considering words "nhà cửa", "sắc đẹp", "hiệu sách". They are words in the following sentences: a. Nhà cửa bề bộn quá b. Cô ấy giữ gìn sắc đẹp. c. Ngoài hiệu sách có bán cuốn này

And they are not words in: a. Ở nhà cửa ngõ chẳng đóng gì cả. b. Bức này màu sắc đẹp hơn. c. Ngoài cửa hiệu sách báo bày la liệt.

26/11/2015

Many tools such as ChaSen, Yamcha, …

このひとことで元気になった

to do such a simple task

Example: Guideline for POS tagging

36 word labels in

English, from Penn

Treebank (1989)

30 word labels in

Chinese, from

Chinese TreeBank

(1998)

47 word labels in Thai,

from Orchid corpus

(1997)

How many for

Vietnamese?

SP7.3

Vietnamese treebank

SP7.4

E-V corpora of aligned

sentences

SP3

English-Vietnamese

translation system

SP4

IREST: Internet use

support system

SP5

Vietnamese spelling

checker

SP8.2

Vietnamese word

Segmentation

SP8.3

Vietnamese POS tagger

SP8.4

Vietnamese chunker

SP8.5

Vietnamese syntax

analyser

SP7.1

English-Vietnamese

dictionary

SP7.2

Viet dictionary

SP1

Apllicationoriented

systems based on

Vietnamese speech

recognition & synthesis

SP2

Speech recognition

system with

large vocabulary

SP8.1

Speech analysis tools

SP6.1

Corpora for

speech recognition

SP6.2

Corpora for

speech synthesis

SP6.3

Corpora for

specific words

Project target products

To be standard

for long term

development

Content


Vietnamese Language

VLSP Standard


Propose idea

26/11/2015

NLP tools + resources

All the tools: Word segmentation, POS tagging, Chunking, Syntax analysis are constructed based on the same view of words, label assignment, sentences, Viet dictionary and Viet Treebank.

Using statistical and machine learning methods in building such tools.

All the tools and resources is given to the R&D community.

26/11/2015

Vietnamese WordNet 2012-2015

Developing Vietnamese WordNet with the following features: Vietnamese WordNet with 50.000 words (30.000 popular words

and 20.000 domain-based)

30.000 synset

Accuracy: 95% for terms in the same synset, 90% in the relationship between different synsets

Develop API for WordNet users

Develop a tool to access, verify and update

Propose guideline for long term WordNet development

NLP Resources

VietTreebank 10,000 trees; 1,000,000 morphemes

Tools: text graphical edit, log and history view, agreement check, search by words, syntactic patterns

Vietnamese Machine Readable Dictionary Model of VCL (Vietnamese Computational Lexicon) by learning from

other language’s MRDs with morphological, syntactic and semantic information.

35,000 Vietnamese common used words in modern Vietnamese

Develop a tool for building VCL with XML representation

SP7.3

Vietnamese treebank

SP7.4

E-V corpora of aligned

sentences

SP3

English-Vietnamese

translation system

SP4

IREST: Internet use

support system

SP5

Vietnamese spelling

checker

SP8.2

Vietnamese word

Segmentation

SP8.3

Vietnamese POS tagger

SP8.4

Vietnamese chunker

SP8.5

Vietnamese syntax

analyser

SP7.1

English-Vietnamese

dictionary

SP7.2

Viet dictionary

SP1

Apllicationoriented

systems based on

Vietnamese speech

recognition & synthesis

SP2

Speech recognition

system with

large vocabulary

SP8.1

Speech analysis tools

SP6.1

Corpora for

speech recognition

SP6.2

Corpora for

speech synthesis

SP6.3

Corpora for

specific words

Project target products

To be standard

for long term

development

Ông già

S

NP VP

P V

đi

NP

T

nhanh quá

SP7.3: Viet Treebank

A Treebank or parsed corpus is a text corpus in

which each sentence has been parsed, i.e.

annotated with syntactic structure.

English: Penn Treebank (4.5M words) and many

others;

Chinese: Penn Chinese Treebank (507K words),

Sinica Treebank (61,087 trees, 361K words); Japanese: ATR Dependency corpus, Kyoto Text

Corpus, Verbmobil treebanks;

Korean: Korean Treebank

(5078 trees, 54K words)

Viet Treebank (2012):

10,000 trees

1,000,000 morphemes Viet machine translation, info extraction, etc.

Viet Treebank

Viet syntactic parser

Viet chunker

Viet POS tagger

Viet word segmenter

Study various existing treebanks, modern theories for

syntax and Vietnamese language

Build guidelines for word segmentation, POS, and syntax

“Nhà cửa bề bộn quá” and “Ở nhà cửa ngõ chẳng đóng gì cả”

(“the house is in jumble” and “at home the door is not closed”)

“Cô ấy giữ gìn sắc đẹp” and “Bức này màu sắc đẹp hơn”

(She keeps her beauty” and “this painting has better color”)

Build the tools

Labeling Agreement between labelers (95%)

SP7.3: Viet Treebank

NLP Tools

Word segmentation Methods: n-gram + dictionary + regular expression

97,1% based on VieTreebank with annotated 220.000 vietnamese words

98,2% based on 100 sentences not included in VieTreebank

POS tagger Methods: MEMs, CRFs

Training: 20.000 sentences with POS from VieTreebank and VN dictionary

90%

Syntactic parser 1 Method: HPSG grammar

P = 82%, R = 74%, F-score = 78% tested on 100 sentences in VieTreebank

• Syntactic parser 2 Method: LPCFG, Bikel’s implementation

F-score = 78% tested on 9600 sentences in VieTreebank

Chunker CRF, online learning on > 9.000 sentences with POS as in VieTreebank

94% 26/11/2015

Content


Vietnamese Language

VLSP Standard


Propose idea

26/11/2015

We need Asian Language Treebank

ALT is the key resources of most of Asian languages.

Can constructs from multi-lingual corpora among all Asian languages with

The same standard of infrastructure

The same kind of tool

…

Accelerates research of NLP for Asian languages

We have Treebank for English, Japanese, Vietnamese

How about Indonesian, Thai, Khmer, Laos, Malay, Myammar, Philippine..

2015/11/26

Word segmenter

POS tagger

Chunker

Syntactic parser

….

Search engines,

Information retrieval

machine translation

QA system

….

Documents

Current status of Vietnamese Treebank usefulness of ......Vietnamese WordNet with 50.000 words (30.000 popular words and 20.000 domain-based) 30.000 synset Accuracy: 95% for terms