26
Current status of Vietnamese Treebank usefulness of collaboration with Asian Language Treebank VU TAT THANG Dept. of Multimedia Human-Machine Language Technology, Institute of Information Technology, Vietnam Acadamy of Science and Technology.

Current status of Vietnamese Treebank usefulness of ......Vietnamese WordNet with 50.000 words (30.000 popular words and 20.000 domain-based) 30.000 synset Accuracy: 95% for terms

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Current status of Vietnamese Treebank usefulness of collaboration with

    Asian Language Treebank

    VU TAT THANG

    Dept. of Multimedia Human-Machine Language Technology,

    Institute of Information Technology,

    Vietnam Acadamy of Science and Technology.

  • Content

    IOIT and International Collaborations

    Vietnamese Language

    VLSP Standard

    Current status of Vietnamese Processing

    Propose idea

    26/11/2015

  • Content

    IOIT and International Collaborations

    Vietnamese Language

    VLSP Standard

    Current status of Vietnamese Processing

    Propose idea

    26/11/2015

  • IOIT – a member of ASEAN MT

    Member of “Network-based ASEAN Languages Translation Public Service Project”, 2012- 2015. Lead by NECTEC – Thailand

    The communication among people in the ASEAN region has increased gradually and will become extreme especially after 2015 when the ASEAN Community begins. The automatic machine translation (MT) system has become more and more important to facilitate the cross-language communication, but has been limited for ASEAN countries.

    Sharing language data

    Develop platform

    Integration of translation system

  • IOIT – a member of A-STAR (U-STAR)

    A-STAR (Asian Speech Translation Advanced Research), 2008-2010

    U-STAR (Universal Speech Translation Advanced Research),

    2010 – till now

    26/11/2015

  • Multilingual

    Speech

    Recognition

    Large Scale

    Vietnamese Speech

    Corpora

    Large Scale Parallel

    Corpora of

    Vietnamese and English

    Large Scale

    English Speech

    Corpora

    Mechanism of S2s system

    Spoken

    Language

    Translation

    Multilingual

    Speech

    Synthesis Vietnamese

    English

    I go to school Tôi đi đén trường

    Large Scale

    VietnameseText

    Corpora

    Corpus-based

    Speech Synthesis

    I go to school

    Corpora

    Statistical Speech

    Recognition

    Statistical Machine

    Translation +

    Multi-engine approach

    Large Scale

    English Text

    Corpora

    Tôi đi đén trường

  • Content

    IOIT and International Collaborations

    Vietnamese Language

    VLSP Standard

    Current status of Vietnamese Processing

    Propose idea

    26/11/2015

  • 8 8

    Spoken as mother tongue by

    86% of Vietnam’s population

    ~ 3 million overseas Vietnamese – most live in US

    It is part of the Austro-asiatic

    language family (168 languages)

    Many vocabulary has been borrowed from Chinese

    Writing system:

    Formerly, Chinese writing system

    Today: Latin alphabet, with additional diacritics for tones and certain letters

    Dialects: Northern, Central, Southern

    Vietnamese Language

    26/11/2015

  • Vietnamese language was

    established a long time ago

    Chinese characters was

    used for a long time

    Unique writing system of

    Vietnam called Chu Nom

    (字喃) in the 10th century

    Romanced script to

    represent the Quốc Ngữ

    since the beginning of the

    20th century

    Nam quốc sơn hà Nam đế cư

    南 国 山 河 南 帝 居 Over Mountains and Rivers of the

    South, Reigns the Emperor of the South

    Vietnamese language

  • Content

    IOIT and International Collaborations

    Vietnamese Language

    VLSP Standard

    Current status of Vietnamese Processing

    Propose idea

    26/11/2015

  • Setting up the VLSP “standards” for the public

    Importance of “standards” in VLSP: choose an unified

    view from various schools on Vietnamese language

    Guide for words recognition and description:

    morphological, syntactic, semantic criteria

    Guide for constituent labeling: noun phrase, verb

    phrase, clause, etc.

    Guide for sentence split

    Others

    26/11/2015

  • National project with eleven

    active research groups on

    VLSP (Vietnamese Language

    and Speech Processing)

    Building VLSP infrastructure,

    especially indispensable

    resources and tools for the

    VLSP development.

    Building and developing

    several typical VLSP

    products for public end-

    users.

    VLSP national project

    Natural language

    processing

    methods

    Pragmatics:

    Speech, text

    and Web data

    mining

    Tools,

    corpora,

    resources

    26/11/2015

  • some ML/Stat no ML/Stat

    Pages 11-12 from Marie Claire, ECML/PKDD 2005

    ML and statistical methods in NLP

  • Word Segmentation

    Considering words "nhà cửa", "sắc đẹp", "hiệu sách". They are words in the following sentences: a. Nhà cửa bề bộn quá b. Cô ấy giữ gìn sắc đẹp. c. Ngoài hiệu sách có bán cuốn này

    And they are not words in: a. Ở nhà cửa ngõ chẳng đóng gì cả. b. Bức này màu sắc đẹp hơn. c. Ngoài cửa hiệu sách báo bày la liệt.

    26/11/2015

    Many tools such as ChaSen, Yamcha, …

    このひとことで元気になった

    to do such a simple task

  • Example: Guideline for POS tagging

    36 word labels in

    English, from Penn

    Treebank (1989)

    30 word labels in

    Chinese, from

    Chinese TreeBank

    (1998)

    47 word labels in Thai,

    from Orchid corpus

    (1997)

    How many for

    Vietnamese?

  • SP7.3

    Vietnamese treebank

    SP7.4

    E-V corpora of aligned

    sentences

    SP3

    English-Vietnamese

    translation system

    SP4

    IREST: Internet use

    support system

    SP5

    Vietnamese spelling

    checker

    SP8.2

    Vietnamese word

    Segmentation

    SP8.3

    Vietnamese POS tagger

    SP8.4

    Vietnamese chunker

    SP8.5

    Vietnamese syntax

    analyser

    SP7.1

    English-Vietnamese

    dictionary

    SP7.2

    Viet dictionary

    SP1

    Apllicationoriented

    systems based on

    Vietnamese speech

    recognition & synthesis

    SP2

    Speech recognition

    system with

    large vocabulary

    SP8.1

    Speech analysis tools

    SP6.1

    Corpora for

    speech recognition

    SP6.2

    Corpora for

    speech synthesis

    SP6.3

    Corpora for

    specific words

    Project target products

    To be standard

    for long term

    development

  • Content

    IOIT and International Collaborations

    Vietnamese Language

    VLSP Standard

    Current status of Vietnamese Processing

    Propose idea

    26/11/2015

  • NLP tools + resources

    All the tools: Word segmentation, POS tagging, Chunking, Syntax analysis are constructed based on the same view of words, label assignment, sentences, Viet dictionary and Viet Treebank.

    Using statistical and machine learning methods in building such tools.

    All the tools and resources is given to the R&D community.

    26/11/2015

  • Vietnamese WordNet 2012-2015

    Developing Vietnamese WordNet with the following features: Vietnamese WordNet with 50.000 words (30.000 popular words

    and 20.000 domain-based)

    30.000 synset

    Accuracy: 95% for terms in the same synset, 90% in the relationship between different synsets

    Develop API for WordNet users

    Develop a tool to access, verify and update

    Propose guideline for long term WordNet development

  • NLP Resources

    VietTreebank 10,000 trees; 1,000,000 morphemes

    Tools: text graphical edit, log and history view, agreement check, search by words, syntactic patterns

    Vietnamese Machine Readable Dictionary Model of VCL (Vietnamese Computational Lexicon) by learning from

    other language’s MRDs with morphological, syntactic and semantic information.

    35,000 Vietnamese common used words in modern Vietnamese

    Develop a tool for building VCL with XML representation

  • SP7.3

    Vietnamese treebank

    SP7.4

    E-V corpora of aligned

    sentences

    SP3

    English-Vietnamese

    translation system

    SP4

    IREST: Internet use

    support system

    SP5

    Vietnamese spelling

    checker

    SP8.2

    Vietnamese word

    Segmentation

    SP8.3

    Vietnamese POS tagger

    SP8.4

    Vietnamese chunker

    SP8.5

    Vietnamese syntax

    analyser

    SP7.1

    English-Vietnamese

    dictionary

    SP7.2

    Viet dictionary

    SP1

    Apllicationoriented

    systems based on

    Vietnamese speech

    recognition & synthesis

    SP2

    Speech recognition

    system with

    large vocabulary

    SP8.1

    Speech analysis tools

    SP6.1

    Corpora for

    speech recognition

    SP6.2

    Corpora for

    speech synthesis

    SP6.3

    Corpora for

    specific words

    Project target products

    To be standard

    for long term

    development

  • Ông già

    S

    NP VP

    P V

    đi

    NP

    T

    nhanh quá

    SP7.3: Viet Treebank

    A Treebank or parsed corpus is a text corpus in

    which each sentence has been parsed, i.e.

    annotated with syntactic structure.

    English: Penn Treebank (4.5M words) and many

    others;

    Chinese: Penn Chinese Treebank (507K words),

    Sinica Treebank (61,087 trees, 361K words); Japanese: ATR Dependency corpus, Kyoto Text

    Corpus, Verbmobil treebanks;

    Korean: Korean Treebank

    (5078 trees, 54K words)

    Viet Treebank (2012):

    10,000 trees

    1,000,000 morphemes Viet machine translation, info extraction, etc.

    Viet Treebank

    Viet syntactic parser

    Viet chunker

    Viet POS tagger

    Viet word segmenter

  • Study various existing treebanks, modern theories for

    syntax and Vietnamese language

    Build guidelines for word segmentation, POS, and syntax

    “Nhà cửa bề bộn quá” and “Ở nhà cửa ngõ chẳng đóng gì cả”

    (“the house is in jumble” and “at home the door is not closed”)

    “Cô ấy giữ gìn sắc đẹp” and “Bức này màu sắc đẹp hơn”

    (She keeps her beauty” and “this painting has better color”)

    Build the tools

    Labeling Agreement between labelers (95%)

    SP7.3: Viet Treebank

  • NLP Tools

    Word segmentation Methods: n-gram + dictionary + regular expression

    97,1% based on VieTreebank with annotated 220.000 vietnamese words

    98,2% based on 100 sentences not included in VieTreebank

    POS tagger Methods: MEMs, CRFs

    Training: 20.000 sentences with POS from VieTreebank and VN dictionary

    90%

    Syntactic parser 1 Method: HPSG grammar

    P = 82%, R = 74%, F-score = 78% tested on 100 sentences in VieTreebank

    • Syntactic parser 2 Method: LPCFG, Bikel’s implementation

    F-score = 78% tested on 9600 sentences in VieTreebank

    Chunker CRF, online learning on > 9.000 sentences with POS as in VieTreebank

    94% 26/11/2015

  • Content

    IOIT and International Collaborations

    Vietnamese Language

    VLSP Standard

    Current status of Vietnamese Processing

    Propose idea

    26/11/2015

  • We need Asian Language Treebank

    ALT is the key resources of most of Asian languages.

    Can constructs from multi-lingual corpora among all Asian languages with

    The same standard of infrastructure

    The same kind of tool

    Accelerates research of NLP for Asian languages

    We have Treebank for English, Japanese, Vietnamese

    How about Indonesian, Thai, Khmer, Laos, Malay, Myammar, Philippine..

    2015/11/26

    Word segmenter

    POS tagger

    Chunker

    Syntactic parser

    ….

    Search engines,

    Information retrieval

    machine translation

    QA system

    ….