Word Segmentation and Transliteration in Chinese and...

Word Segmentation and

Transliteration

in Chinese and Japanese

Masato Hagiwara Rakuten Institute of Technology, New York

CUNY NLP Seminar 4/5/2013

Who am I?

Baidu, Inc. Beijing, China Rakuten Institute of Technology,

New York

Ph.D. from Nagoya University (2009)

Internship at Google and Microsoft Research (2005, 2008)

R&D Engineer at Baidu, Japan (2009-2010)

HAGIWARA, Masato (萩原正人) Senior Scientist at Rakuten Institute of Technology, New York

Agenda

Word Segmentation Transliteration

Integrated Models

Word Segmentation in Chinese and Japanese

Maximum Forward Match

日文章鱼怎么说

[Wong and Chan 1996]

日 (day)

日文 (Japanese)

文章 (article)

章鱼 (octopus)

鱼 (fish)

怎么 (how)

说 (say)

lexicon

Greedily match longest lexicon items

from the beginning (or from the end)

日文章鱼怎么说 Japanese octopus how say

How do you say octopus in Japanese?

Examples Where Maximum Match Fails

警察枪杀了那个逃犯

警察用枪杀了那个逃犯

警察 Police

枪杀 gun-kill

了 (perf.)

那 that

个 (mes.)

逃犯 escapee

警察 Police

用 with

枪杀 gun-kill

? ? ? Heuristic rules

“Word Binding” Scores

Police with gun kill (perf.) that escapee

Heuristic Approaches – Minimum Bunsetsu Number

[Yoshimura et al. 1983]

今日本当に良い天気ですね

今日 Today

本当に really

良い good

天気 weather

です is

ね (part.)

今 Now

日本 Japan

当に really

良い good

天気 weather

です is

ね (part.)

# of Bunsetsu = 4

# of Bunsetsu = 5 今 n. (now)

今日 n (today)

... lexicon Optimizes for a whole sentence

What is Bunsetsu?

Minimum Bunsetsu Number

Bunsetsu (文節) = [indep. word] [attch. word]*

今日 Today

本当に really

良い good

天気 weather

です is

ね (part.)

# of Bunsetsu = 4

indep. indep. indep. indep. attch. attch.

min 𝑐𝑜𝑠𝑡(𝑤)

𝑐𝑜𝑠𝑡 𝑤 = 1 𝑤0 𝑤

is an indep. word

is an attch. word

A Special Case of Minimum Cost Methods

[Yoshimura et al. 1983]

Word-based Models

東京 n.

都 suf.

住む v.

東 pre.

京 n.

京都 n.

に p.

BOS EOS

に p.

East Capital Pref. in

min 𝑐𝑜𝑠𝑡1 𝑤𝑖 + 𝑐𝑜𝑠𝑡2 𝑤𝑖−1, 𝑤𝑖

𝑖=1

Training ... HMM, Perceptron, CRF, ...

Decoding ... Viterbi algorithm, A*, ...

[Kudo et al. 2004]

東 (east) pre.

東京 (Tokyo) n. 京 (capital) n.

京都 (Kyoto) n.

都 (Pref.) suf.

に (in) p.

住む (live) v.

lexicon

Character-based Models 1 – Character Tagging

警 [B]

察 [B]

警 [S]

察 [S]

察 [I]

察 [E]

警 [I]

警 [E]

用 [B]

用 [S]

用 [I]

用 [E]

枪 [B]

枪 [S]

枪 [I]

枪 [E]

杀 [B]

杀 [S]

杀 [I]

杀 [E]

Training ... HMM, CRF, ME ...

Decoding ... Viterbi algorithm, A*, ...

[Xue and Shen 2003] [Peng et al. 2004]

LMR Tagging

用 [S]

计算机 [B] [I] [E]

Character-based Models 2 – “Boundary” Tagging

宝石を磨く

0 1 1 1 boundary tags

宝石を磨く segmented words

[Neubig et al. 2011]

Boundary Decisions ... Independent from each other

“Gains provided by structured prediction

can be largely recovered by using a richer feature set.”

[Liang et al. 2008]

Logistic Regression

𝑥𝑙 , 𝑥𝑟 , 𝑥𝑙−1𝑥𝑙, 𝑥𝑙𝑥𝑟, …

𝑐 𝑥𝑙 , 𝑐 𝑥𝑟 , …

𝑙𝑠, 𝑟𝑠, 𝑖𝑠, …(dict. feat.)

One-at-a-Time PoS Tagging Models

宝石を磨く

0 1 1 1 boundary tags

宝石を磨く segmented words

POS Tags N P V Suf

[Neubig et al. 2011]

Logistic Regression

Enables Domain Adaptation through Partial Annotation

Pointwise Approaches and Active Learning

| ア-ク-チ-ン|フ-ィ-ラ-メ-ン-ト|は|細胞内小器官|の|１|つ|だ|

partial annotation

non-boundary boundary “don’t care”

Character-based Joint Models

警 [B-Noun]

察 [B-Noun]

警 [S-Noun]

察 [S-Noun]

察 [I-Noun]

察 [E-Noun]

警 [I-Noun]

警 [E-Noun]

用 [B-Noun]

用 [S-Noun]

用 [I-Noun]

用 [E-Noun]

枪 [B-Noun]

枪 [S-Noun]

枪 [I-Noun]

枪 [E-Noun]

杀 [B-Noun]

杀 [S-Noun]

杀 [I-Noun]

杀 [E-Noun]

警 [B-Verb]

察 [B-Verb]

用 [B-Verb]

枪 [B-Verb]

杀 [B-Verb]

[Kruengkrai et al. 2009] [Nakagawa+Uchimoto 2007]

LMR Tagging

用 [S-Prep]

计算机 [B-Noun] [I-Noun] [E-Noun]

Stack Decoding Models

東京 [noun]

東 [noun]

京 [noun]

東 [noun]

東京 [noun]

東 [noun]

京 [noun]

東 [noun]

京都

東京 [noun]

東 [noun]

京 [noun]

東 [noun]

京都 [noun]

東京 [noun]

都 [post]

reduce

- No distinction between known and unknown words

- Flexible sets of features (e.g., long distance constraints)

[Zhang, Clark 2008] [Okanohara, Tsujii 2008]

Chinese/Japanese WS Evolution

Heuristics

Maximum Forward Match

Heuristics

Minimum Bunsetsu Number

Semi-Markov

Word-based Models

Character-based

One-at-a-Time Models

Character-based

All-at-once Models

Boundary Tagging Models

Stack Decoding Models

Chinese Japanese

Generative

Models

Discriminative

Models

Chinese/Japanese WS Evolution

Heuristics Statistics

Word-based Character-based

Pipeline (One-at-a-time)

Joint (All-at-Once)

Generative Discriminative

Viterbi Stack Decoding

→ Statistics

→ Ja: word, Zh: character

→ Joint

→ Discriminative

→ Pros and Cons

Transliteration (“Semantic” Transliteration Models)

Transliteration

Phonetic translation between languages

with different writing systems

New York / 纽约 niuyue / ニューヨーク nyuuyooku

Obama / 奥巴马 aobama / オバマ obama

Phoneme-based Methods

[Knight and Graehl 1998]

English word

English phoneme

Japanese phoneme

Japanese word

golf ball

G AA L F B AO L

g o r u h u b o o r u

ゴルフボール

𝑃(𝑤)

𝑃(𝑒|𝑤)

𝑃(𝑗|𝑒)

𝑃(𝑘|𝑗)

𝑃 𝑤 𝑃 𝑒 𝑤 𝑃(𝑗|𝑒)𝑃(𝑘|𝑗)

Trains a large WFST (from Japanese to English words)

Direct Orthographical Mapping

P(flextime→furekkusutaime)

= P(f→fu|BOW)×P(le→re|f→fu)×P(x→kkusu|le→re)× …

Joint Source Channel Model

fl ext im e

frek ku suta imu

p i aget

pi a j e

Transliteration Prob. ＝Prod. of TU n-gram probs.

TU Probability Estimation

Training

Corpus

TU Probability Table

P( fl→flek |・) = XXX

P( ext→ku |・) = YYY

P( p→pi |・) = ZZZ

EM Algorithm

Freq.→Prob. Current Alignment

Viterbi Algorithm

[Li et al. 2004]

Multiple Language Origins

亚历山大 Yalishanda / Alexander 山本 Yamamoto / Yamamoto

piaget / piaje ピアジェ

target / taagetto ターゲット

French origin

English origin

French model

English model

Indo-European origin

Japanese origin

Chinese Transliteration Model

Japanese Reading Model

Marian / Malian 玛丽安 Marino / Malinuo 马里诺

Female name

Male name

Female model

Male model

[Li et al. 2007]

Latent Class Transliteration

Class transliteration [Li et al. 2007]

Latent Class Transliteration [Hagiwara&Sekine 2011]

Explicit language detection

Latent class distribution

s: source

t: target

z: latent class

K: # of latent classes

(determined using dev. sets)

[Hagiwara and Sekine 2011]

Iterative Learning via EM Algorithm

piaget → piaje target → taaget

p/i/a/get→pi/a/j/e

t/ar/get→taa/ge/tto

Lx Ly Lz

Training Pairs

P(pi→pi ピ) P(ar→aa アー) P(get→je ジェ) P(get→getto ゲット) …

Transliteration Model

Lx Ly Lz

pi/a/get→pi/a/je

tar/get→taa/getto

Lx Ly Lz

E step Transliteration probability

Based on viterbi search

Iterative Learning via EM Algorithm

piaget → piaje target → taaget

p/i/a/get→pi/a/j/e

t/ar/get→taa/ge/tto

Lx Ly Lz

Update

M step

Σγ*f(get→je ジェ)

Training Pairs

P(pi→pi ピ) P(ar→aa アー) P(get→je ジェ) P(get→getto ゲット) …

Transliteration Model

Lx Ly Lz P(pi→pi ピ) P(ar→aa アー) P(get→je ジェ) P(get→getto ゲット) …

TransliterationModel

Lx Ly Lz

pi/a/get→pi/a/je

tar/get→taa/getto

Lx Ly Lz

E step Transliteration probability

Based on viterbi search

Latent Semantic Transliteration Model

using Dirichlet Mixture

Polya distribution

𝑃(𝑢|𝑧1)

𝑃𝐷𝑖𝑟(𝑝; 𝛼1) 𝑃𝐷𝑖𝑟(𝑝; 𝛼2)

𝑃𝐷𝑖𝑟(𝑝; 𝛼3)

𝑢1=get/je

Latent Class Transliteration [Hagiwara&Sekine 11] 𝑃(𝑢|𝑧2)

𝑃(𝑢|𝑧3) 𝑢2

=get/getto

Latent Semantic Transliteration using Dirichlet Mixture (Proposed)

French

English

Multinomial Dirichlet Mixture

Discriminative Transliteration Model

[Jiampojamarn et al. 2008] [Cherry and Suzuki 2009]

construct

(s, s), (ns, s), (st, s), (ons, s), (nst, s) ...

(n, s)

(s, ns), (ns, ns), (st, ns), (ons, ns), (nst, ns), ...

features:

[kænstrVkt]

search: monotone search for phrasal decoder

Transliteration Evolution

Phoneme-based Models

Grapheme-based Models

Character-based Models

Substring-based Models

Joint Discriminative Models

Generative

Models

Discriminative

Models

Transliteration Model Evolution

Character Substring

Phoneme Grapheme

Uniform Semantic

Generative Discriminative

→ Substring

→ Grapheme

→ Semantic

→ Discriminative

Integrated Models

Compound Noun and Transliteration

ブラキッシュレッド

(burakissyureddo)

*bracki shred

blackish red

ブラキッシュレッド

贝拉克奥巴马 (beilakeaobama)

贝拉克奥巴马

barack obama

English

Language

Transliteration

Source/Target Language Statistics

aktionsplan

aktionsplan(852)

aktion(960) plan(710)

aktions(5) plan(710)

akt(224) ion(1) plan(710)

German

Corpus

action plan

??? ???

Bilingual

Resources

[Koehn and Knight 2003]

Use of Monolingual Paraphrase and Transliteration

[Kaji, Kitsuregawa 2011]

オックスフォードディクショナリー

(okkusufoododikushonarii)

オッ

BOS . . . オックスフォード

オックスフォードデ

ック

ディ

ディク

. . . ディクショナリー

ィクショナリー

クショナリー

ショナリー EOS

オックスフォード oxford

ディクショナリー dictionary

ジャンクフード junk food

translit. corpus

アンチョビパスタ anchovy pasta

アンチョビ・パスタ anchovy pasta

アンチョビのパスタ anchovy pasta

paraphrases

Language Projection via “Online” Transliteration

ブラキッシュ

ブラ

キッ

レッド EOS

キッシュ

シュレッドブラキッ

ブラキ

大人気

大人気色

bla bra

braki blaki

brackish blackish

led read

shread shred

Transliteration

Model English

21: ϕ1LMP (red)

22: ϕ2LMP (blackish, red)

21: ϕ1LMP(blackish)

[Hagiwara, Sekine 2013]

Agenda

Word Segmentation Transliteration

Integrated Models

References – Chinese Word Segmentation

[Wong and Chan 1996] Pak-kwong Wong and Chorkin Chen.

Chinese Word Segmentation based on Maximum Matching and Word Binding Force, COLING 1996.

[Xue and Shen 2003] Nianwen Xue and Libin Shen.

Chinese Word Segmentation as LMR Tagging, SIGHAN 2003.

[Xue 2003] Nianwen Xue,

Chinese Word Segmentation as Character Tagging, Computational Linguistics and Chinese Language

Processing, 2003.

[Peng et al. 2004] Fuchun peng, Fangfang Feng, Andrew McCallum.

Chinese Segmentation and New Word Detection using Conditional Random Fields, COLING 2004.

[Kruengkrai et al. 2009] Canasai Kruengkrai, Kiyotaka Uchimoto, Jun'ichi Kazama, Yiou Wang, Kentaro

Torisawa, Hitoshi Isahara.

An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging,

ACL/IJCNLP 2009.

[Ng and Low 2004] Hwee Tou Ng and Jin Kiat Low.

Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based?

EMNLP 2004.

[Zhang and Clark 2008] Yue Zhang and Stephen Clark.

Joint Word Segmentation and POS Tagging using a Single Perceptron, ACL 2008.

References – Japanese Morphological Analysis

[Yoshimura et al. 1983] 吉村賢治, 日高達, 吉田将

文節数最小法を用いたべた書き日本語文の形態素解析, 情報処理学会論文誌, 1983.

[Kudo et al. 2004] Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto

Applying Conditional Random Fields to Japanese Morphological Analysis, EMNLP 2004.

[Nakagawa and Uchimoto 2007] Tetsuji Nakagawa and Kiyotaka Uchimoto.

A Hybrid Approach to Word Segmentation and POS Tagging, ACL 2007.

[Neubig et al. 2011] Graham Neubig, Yosuke Nakata, Shinsuke Mori.

Pointwise Prediction for Robust, Adaptable Japanese Morphological Analysis, ACL 2011.

[Okanohara and Tsujii 2008] 岡野原大輔, 辻井潤一

Shift-Reduce操作に基づく未知語を考慮した形態素解析, JNLP 2008.

References – Transliteration

[Knight and Graehl 1998] Kevin Knight and Jonathan Graehl.

Machine Transliteration, Computational Linguistics, 1998.

[Li et al. 2004] Haizhou Li, Min Zhang, Jian Su.

A Joint Source-Channel Model for Machine Transliteration, ACL 2004.

[Li et al. 2007] Haizhou Li* Khe Chai Sim, Jin-Shea Kuo, Minghui Dong.

Semantic Transliteration of Personal Names, ACL 2007.

[Hagiwara and Sekine 2011] Masato Hagiwara and Satoshi Sekine.

Latent Class Transliteration based on Source Language Origins. ALC-HLT, 2011.

Latent Semantic Transliteration using Dirichlet Mixture. NEWS 2012.

[Jiampojamarn et al. 2007] Sittichai Jiampojamarn, Grzegorz Kondrak and Tarek Sherif.

Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion,

NAACL 2007.

[Jiampojamarn et al. 2008] Sittichai Jiampojamarn, Colin Cherry, Grzegorz Kondrak.

Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion, NAACL 2008.

[Cherry and Suzuki 2008] Colin Cherry and Hisami Suzuki.

Discriminative Substring Decoding for Transliteration, EMNLP 2008.

References – Integrated Models

[Koehn and Knight 2003] Philipp Koehn and Kevin Knight.

Empirical Methods for Compound Splitting, EACL 2003.

[Kaji and Kitsuregawa 2011] Nobuhiro Kaji and Masaru Kitsuregawa.

Splitting Noun Compounds via Monolingual and Bilingual Paraphrasing: A Study on Japanese

Katakana Words, EMNLP 2011

Accurate Word Segmentation using Transliteration and Language Model Projection,

ACL 2013 (to appear)

Word Segmentation and Transliteration in Chinese and...

Documents

NLP Programming Tutorial 4 - Word Segmentation NLP Programming Tutorial 4 – Word Segmentation What is Word Segmentation Sentences in Japanese or Chinese are written without spaces

Automatic Construction of Chinese Stop Word Listlwang/research/hangzhou06.pdf · 2006. 5. 23. · Chinese stop word list is constructed based on the aggregated model. 2.1 Word Segmentation

1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

The Third Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition

SEGMENTATION GUIDELINES FOR InterBEST 2009hltshare.fbk.eu/IWSLT2015/InterBEST2009Guidelines-2.pdf · To create word segmentation guidelines for InterBEST 2009 Thai Word Segmentation:

Neural Word Segmentation Learning for Chinese

Neural Word Segmentation Learning for ChineseMotivation Neural Word Segmentation Learning ExperimentsQ&A Task Review Task Review The ultimate goal of word segmentation algorithms is

Chinese Word Segmentation Daniel Zeman zeman@ufal.mff.cuni.cz

Unsupervised Training for Overlapping Ambiguity Resolution in Chinese Word Segmentation

RHYTHM-BASED SEGMENTATION OF POPULAR CHINESE …ismir2005.ismir.net/proceedings/1031.pdf · RHYTHM-BASED SEGMENTATION OF POPULAR CHINESE MUSIC ... k of the Short-Time Fourier Transform

Line and word segmentation of handwritten documents · Keywords: Document Analysis, Handwritten Documents, Hough Transform, Text Line Segmentation, Word Segmentation. 1. Introduction

The Third Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition Gina-Anne Levow Fifth SIGHAN Workshop July 22, 2006

A New Lexicon Mechanism for Chinese Word Segmentation

A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China

Statistical Frequency in Word Segmentation

Segmentation and Spillovers in the Chinese Semi-conductor Industry

CHINESE CHARACTER RECORDING SHEET - … · CHINESE CHARACTER RECORDING SHEET English Word Chinese HANZI PINYIN Definition Chinese WORD/ or Phrase Image/picture Connection

Problem 1: Word Segmentation whatdoesthisreferto

A Stochastic Finite-State Word-Segmentation Algorithm for ...Sproat, Shih, Gale and Chang Word-Segmentation for Chinese raphy: ,K. ren2 'person' is a fairly uncontroversial case of

Chinese Restaurant Process - Univerzita Karlovaufal.mff.cuni.cz/~marecek/npfl097/04_chinese_restaurant.pdf · 2020. 12. 14. · Text Segmentation Chinese Restaurant Process Text Segmentation