40
Word Segmentation and Transliteration in Chinese and Japanese Masato Hagiwara Rakuten Institute of Technology, New York CUNY NLP Seminar 4/5/2013

Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

  • Upload
    tranbao

  • View
    231

  • Download
    9

Embed Size (px)

Citation preview

Page 1: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

Word Segmentation and

Transliteration

in Chinese and Japanese

Masato Hagiwara Rakuten Institute of Technology, New York

CUNY NLP Seminar 4/5/2013

Page 2: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

2

Who am I?

Baidu, Inc. Beijing, China Rakuten Institute of Technology,

New York

Ph.D. from Nagoya University (2009)

Internship at Google and Microsoft Research (2005, 2008)

R&D Engineer at Baidu, Japan (2009-2010)

HAGIWARA, Masato (萩原 正人) Senior Scientist at Rakuten Institute of Technology, New York

Page 3: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

3

Agenda

Word Segmentation Transliteration

Integrated Models

Page 4: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

4

Word Segmentation in Chinese and Japanese

Page 5: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

5

Maximum Forward Match

日 文 章 鱼 怎 么 说

[Wong and Chan 1996]

日 (day)

日文 (Japanese)

文章 (article)

章鱼 (octopus)

鱼 (fish)

怎么 (how)

说 (say)

lexicon

Greedily match longest lexicon items

from the beginning (or from the end)

日 文 章 鱼 怎 么 说 Japanese octopus how say

How do you say octopus in Japanese?

Page 6: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

6

Examples Where Maximum Match Fails

警 察 枪 杀 了 那 个 逃 犯

警 察 用 枪 杀 了 那 个 逃 犯

警 察 Police

枪 杀 gun-kill

了 (perf.)

那 that

个 (mes.)

逃 犯 escapee

警 察 Police

用 with

枪 杀 gun-kill

? ? ? Heuristic rules

“Word Binding” Scores

Police with gun kill (perf.) that escapee

Page 7: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

7

Heuristic Approaches – Minimum Bunsetsu Number

[Yoshimura et al. 1983]

今 日 本 当 に 良 い 天 気 で す ね

今 日 Today

本 当 に really

良 い good

天 気 weather

で す is

ね (part.)

今 Now

日 本 Japan

当 に really

良 い good

天 気 weather

で す is

ね (part.)

# of Bunsetsu = 4

# of Bunsetsu = 5 今 n. (now)

今日 n (today)

... lexicon Optimizes for a whole sentence

Page 8: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

8

What is Bunsetsu?

Page 9: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

9

Minimum Bunsetsu Number

Bunsetsu (文節) = [indep. word] [attch. word]*

今 日 Today

本 当 に really

良 い good

天 気 weather

で す is

ね (part.)

# of Bunsetsu = 4

indep. indep. indep. indep. attch. attch.

min 𝑐𝑜𝑠𝑡(𝑤)

𝑤

where

𝑐𝑜𝑠𝑡 𝑤 = 1 𝑤0 𝑤

is an indep. word

is an attch. word

A Special Case of Minimum Cost Methods

[Yoshimura et al. 1983]

Page 10: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

10

Word-based Models

東 京 n.

都 suf.

住 む v.

東 pre.

京 n.

京 都 n.

に p.

BOS EOS

に p.

Tokyo

East Capital Pref. in

in

live

20

10

40 10

10

10

10 10

10

20 10

min 𝑐𝑜𝑠𝑡1 𝑤𝑖 + 𝑐𝑜𝑠𝑡2 𝑤𝑖−1, 𝑤𝑖

𝑁

𝑖=1

Kyoto

Training ... HMM, Perceptron, CRF, ...

Decoding ... Viterbi algorithm, A*, ...

[Kudo et al. 2004]

東 (east) pre.

東京 (Tokyo) n. 京 (capital) n.

京都 (Kyoto) n.

都 (Pref.) suf.

に (in) p.

住む (live) v.

lexicon

Page 11: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

11

Character-based Models 1 – Character Tagging

警 察 用 枪 杀 了 那 个 逃 犯

警 [B]

察 [B]

警 [S]

察 [S]

察 [I]

察 [E]

警 [I]

警 [E]

用 [B]

用 [S]

用 [I]

用 [E]

枪 [B]

枪 [S]

枪 [I]

枪 [E]

杀 [B]

杀 [S]

杀 [I]

杀 [E]

. . .

Training ... HMM, CRF, ME ...

Decoding ... Viterbi algorithm, A*, ...

[Xue and Shen 2003] [Peng et al. 2004]

LMR Tagging

用 [S]

计算机 [B] [I] [E]

Page 12: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

12

Character-based Models 2 – “Boundary” Tagging

宝 石 を 磨 く

0 1 1 1 boundary tags

宝 石 を 磨 く segmented words

[Neubig et al. 2011]

Boundary Decisions ... Independent from each other

“Gains provided by structured prediction

can be largely recovered by using a richer feature set.”

[Liang et al. 2008]

SVM

Logistic Regression

𝑥𝑙 , 𝑥𝑟 , 𝑥𝑙−1𝑥𝑙, 𝑥𝑙𝑥𝑟, …

𝑐 𝑥𝑙 , 𝑐 𝑥𝑟 , …

𝑙𝑠, 𝑟𝑠, 𝑖𝑠, …(dict. feat.)

..

Page 13: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

13

One-at-a-Time PoS Tagging Models

宝 石 を 磨 く

0 1 1 1 boundary tags

宝 石 を 磨 く segmented words

POS Tags N P V Suf

[Neubig et al. 2011]

SVM

Logistic Regression

Enables Domain Adaptation through Partial Annotation

Page 14: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

14

Pointwise Approaches and Active Learning

| ア-ク-チ-ン|フ-ィ-ラ-メ-ン-ト|は|細 胞 内 小 器 官|の|1|つ|だ|

partial annotation

non-boundary boundary “don’t care”

Page 15: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

15

Character-based Joint Models

警 察 用 枪 杀 了 那 个 逃 犯

警 [B-Noun]

察 [B-Noun]

警 [S-Noun]

察 [S-Noun]

察 [I-Noun]

察 [E-Noun]

警 [I-Noun]

警 [E-Noun]

用 [B-Noun]

用 [S-Noun]

用 [I-Noun]

用 [E-Noun]

枪 [B-Noun]

枪 [S-Noun]

枪 [I-Noun]

枪 [E-Noun]

杀 [B-Noun]

杀 [S-Noun]

杀 [I-Noun]

杀 [E-Noun]

. . .

警 [B-Verb]

察 [B-Verb]

用 [B-Verb]

枪 [B-Verb]

杀 [B-Verb]

[Kruengkrai et al. 2009] [Nakagawa+Uchimoto 2007]

LMR Tagging

用 [S-Prep]

计 算 机 [B-Noun] [I-Noun] [E-Noun]

Page 16: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

16

Stack Decoding Models

東京 [noun]

東 [noun]

京 [noun]

東 [noun]

東京 [noun]

東 [noun]

京 [noun]

東 [noun]

京都

東京 [noun]

東 [noun]

京 [noun]

東 [noun]

京都 [noun]

東京 [noun]

都 [post]

shift

reduce

shift

shift

- No distinction between known and unknown words

- Flexible sets of features (e.g., long distance constraints)

[Zhang, Clark 2008] [Okanohara, Tsujii 2008]

Page 17: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

17

Chinese/Japanese WS Evolution

Heuristics

Maximum Forward Match

Heuristics

Minimum Bunsetsu Number

Semi-Markov

Word-based Models

Character-based

One-at-a-Time Models

Character-based

All-at-once Models

Boundary Tagging Models

Stack Decoding Models

Chinese Japanese

Generative

Models

Discriminative

Models

Page 18: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

18

Chinese/Japanese WS Evolution

Heuristics Statistics

Word-based Character-based

Pipeline (One-at-a-time)

Joint (All-at-Once)

Generative Discriminative

Viterbi Stack Decoding

→ Statistics

→ Ja: word, Zh: character

→ Joint

→ Discriminative

→ Pros and Cons

Page 19: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

19

Transliteration (“Semantic” Transliteration Models)

Page 20: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

20

Transliteration

Phonetic translation between languages

with different writing systems

New York / 纽约 niuyue / ニューヨーク nyuuyooku

Obama / 奥巴马 aobama / オバマ obama

Page 21: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

21

Phoneme-based Methods

[Knight and Graehl 1998]

English word

English phoneme

Japanese phoneme

Japanese word

golf ball

G AA L F B AO L

g o r u h u b o o r u

ゴルフボール

𝑃(𝑤)

𝑃(𝑒|𝑤)

𝑃(𝑗|𝑒)

𝑃(𝑘|𝑗)

𝑃 𝑤 𝑃 𝑒 𝑤 𝑃(𝑗|𝑒)𝑃(𝑘|𝑗)

Trains a large WFST (from Japanese to English words)

Page 22: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

22

Direct Orthographical Mapping

P(flextime→furekkusutaime)

= P(f→fu|BOW)×P(le→re|f→fu)×P(x→kkusu|le→re)× …

Joint Source Channel Model

fl ext im e

frek ku suta imu

p i aget

pi a j e

Transliteration Prob. =Prod. of TU n-gram probs.

TU Probability Estimation

Training

Corpus

TU Probability Table

P( fl→flek |・) = XXX

P( ext→ku |・) = YYY

P( p→pi |・) = ZZZ

EM Algorithm

Freq.→Prob. Current Alignment

Viterbi Algorithm

[Li et al. 2004]

Page 23: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

23

Multiple Language Origins

亚历山大 Yalishanda / Alexander 山本 Yamamoto / Yamamoto

piaget / piaje ピアジェ

target / taagetto ターゲット

French origin

English origin

French model

English model

Indo-European origin

Japanese origin

Chinese Transliteration Model

Japanese Reading Model

Marian / Malian 玛丽安 Marino / Malinuo 马里诺

Female name

Male name

Female model

Male model

[Li et al. 2007]

Page 24: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

24

Latent Class Transliteration

Class transliteration [Li et al. 2007]

Latent Class Transliteration [Hagiwara&Sekine 2011]

Explicit language detection

Latent class distribution

s: source

t: target

z: latent class

K: # of latent classes

(determined using dev. sets)

[Hagiwara and Sekine 2011]

Page 25: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

25

Iterative Learning via EM Algorithm

piaget → piaje target → taaget

p/i/a/get→pi/a/j/e

t/ar/get→taa/ge/tto

Lx Ly Lz

Training Pairs

P(pi→pi ピ) P(ar→aa アー) P(get→je ジェ) P(get→getto ゲット) …

Transliteration Model

Lx Ly Lz

pi/a/get→pi/a/je

tar/get→taa/getto

Lx Ly Lz

E step Transliteration probability

Based on viterbi search

[Hagiwara and Sekine 2011]

Page 26: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

26

Iterative Learning via EM Algorithm

piaget → piaje target → taaget

p/i/a/get→pi/a/j/e

t/ar/get→taa/ge/tto

Lx Ly Lz

Update

M step

Σγ*f(get→je ジェ)

Training Pairs

P(pi→pi ピ) P(ar→aa アー) P(get→je ジェ) P(get→getto ゲット) …

Transliteration Model

Lx Ly Lz P(pi→pi ピ) P(ar→aa アー) P(get→je ジェ) P(get→getto ゲット) …

TransliterationModel

Lx Ly Lz

pi/a/get→pi/a/je

tar/get→taa/getto

Lx Ly Lz

E step Transliteration probability

Based on viterbi search

[Hagiwara and Sekine 2011]

Page 27: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

27

Latent Semantic Transliteration Model

using Dirichlet Mixture

Polya distribution

𝑃(𝑢|𝑧1)

𝑃𝐷𝑖𝑟(𝑝; 𝛼1) 𝑃𝐷𝑖𝑟(𝑝; 𝛼2)

𝑃𝐷𝑖𝑟(𝑝; 𝛼3)

𝑢1=get/je

Latent Class Transliteration [Hagiwara&Sekine 11] 𝑃(𝑢|𝑧2)

𝑃(𝑢|𝑧3) 𝑢2

=get/getto

Latent Semantic Transliteration using Dirichlet Mixture (Proposed)

𝑢3

French

English

Multinomial Dirichlet Mixture

[Hagiwara and Sekine 2012]

Page 28: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

28

Discriminative Transliteration Model

[Jiampojamarn et al. 2008] [Cherry and Suzuki 2009]

construct

(s, s), (ns, s), (st, s), (ons, s), (nst, s) ...

(n, s)

(s, ns), (ns, ns), (st, ns), (ons, ns), (nst, ns), ...

features:

[kænstrVkt]

search: monotone search for phrasal decoder

Page 29: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

29

Transliteration Evolution

Phoneme-based Models

Grapheme-based Models

Character-based Models

Substring-based Models

Joint Discriminative Models

Generative

Models

Discriminative

Models

Page 30: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

30

Transliteration Model Evolution

Character Substring

Phoneme Grapheme

Uniform Semantic

Generative Discriminative

→ Substring

→ Grapheme

→ Semantic

→ Discriminative

Page 31: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

31

Integrated Models

Page 32: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

32

Compound Noun and Transliteration

ブラキッシュレッド

(burakissyureddo)

*bracki shred

blackish red

ブラキッ シュレッド

ブラキッシュ レッド

贝拉克奥巴马 (beilakeaobama)

贝拉克 奥巴马

barack obama

English

Language

Model

Transliteration

Model

Page 33: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

33

Source/Target Language Statistics

aktionsplan

aktionsplan(852)

aktion(960) plan(710)

aktions(5) plan(710)

akt(224) ion(1) plan(710)

German

Corpus

852

825.6

59.6

54.2

???

action plan

plan

plan

2

???

??? ???

0

1

1

Bilingual

Resources

[Koehn and Knight 2003]

Page 34: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

34

Use of Monolingual Paraphrase and Transliteration

[Kaji, Kitsuregawa 2011]

オックスフォードディクショナリー

(okkusufoododikushonarii)

オッ

BOS . . . オックスフォード

オックスフォードデ

ック

ディ

ディク

. . . ディクショナリー

ィクショナリー

クショナリー

ショナリー EOS

オックスフォード oxford

ディクショナリー dictionary

ジャンク フード junk food

translit. corpus

アンチョビパスタ anchovy pasta

アンチョビ・パスタ anchovy pasta

アンチョビのパスタ anchovy pasta

paraphrases

Page 35: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

35

Language Projection via “Online” Transliteration

ブ ラ キ ッ シ ュ

ブ ラ

キ ッ

レ ッ ド EOS

キ ッ シ ュ

シ ュ レ ッ ド ブ ラ キ ッ

ブ ラ キ

. . .

. . .

大 人 気

大 人 気

大 人 気 色

bu ki

bla bra

kish

braki blaki

braki blaki

brackish blackish

led read

red

shread shred

Transliteration

Model English

LM

BOS

21: ϕ1LMP (red)

22: ϕ2LMP (blackish, red)

21: ϕ1LMP(blackish)

[Hagiwara, Sekine 2013]

Page 36: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

36

Agenda

Word Segmentation Transliteration

Integrated Models

Page 37: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

37

References – Chinese Word Segmentation

[Wong and Chan 1996] Pak-kwong Wong and Chorkin Chen.

Chinese Word Segmentation based on Maximum Matching and Word Binding Force, COLING 1996.

[Xue and Shen 2003] Nianwen Xue and Libin Shen.

Chinese Word Segmentation as LMR Tagging, SIGHAN 2003.

[Xue 2003] Nianwen Xue,

Chinese Word Segmentation as Character Tagging, Computational Linguistics and Chinese Language

Processing, 2003.

[Peng et al. 2004] Fuchun peng, Fangfang Feng, Andrew McCallum.

Chinese Segmentation and New Word Detection using Conditional Random Fields, COLING 2004.

[Kruengkrai et al. 2009] Canasai Kruengkrai, Kiyotaka Uchimoto, Jun'ichi Kazama, Yiou Wang, Kentaro

Torisawa, Hitoshi Isahara.

An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging,

ACL/IJCNLP 2009.

[Ng and Low 2004] Hwee Tou Ng and Jin Kiat Low.

Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based?

EMNLP 2004.

[Zhang and Clark 2008] Yue Zhang and Stephen Clark.

Joint Word Segmentation and POS Tagging using a Single Perceptron, ACL 2008.

Page 38: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

38

References – Japanese Morphological Analysis

[Yoshimura et al. 1983] 吉村 賢治, 日高 達, 吉田 将

文節数最小法を用いたべた書き日本語文の形態素解析, 情報処理学会論文誌, 1983.

[Kudo et al. 2004] Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto

Applying Conditional Random Fields to Japanese Morphological Analysis, EMNLP 2004.

[Nakagawa and Uchimoto 2007] Tetsuji Nakagawa and Kiyotaka Uchimoto.

A Hybrid Approach to Word Segmentation and POS Tagging, ACL 2007.

[Neubig et al. 2011] Graham Neubig, Yosuke Nakata, Shinsuke Mori.

Pointwise Prediction for Robust, Adaptable Japanese Morphological Analysis, ACL 2011.

[Okanohara and Tsujii 2008] 岡野原 大輔, 辻井 潤一

Shift-Reduce操作に基づく未知語を考慮した形態素解析, JNLP 2008.

Page 39: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

39

References – Transliteration

[Knight and Graehl 1998] Kevin Knight and Jonathan Graehl.

Machine Transliteration, Computational Linguistics, 1998.

[Li et al. 2004] Haizhou Li, Min Zhang, Jian Su.

A Joint Source-Channel Model for Machine Transliteration, ACL 2004.

[Li et al. 2007] Haizhou Li* Khe Chai Sim, Jin-Shea Kuo, Minghui Dong.

Semantic Transliteration of Personal Names, ACL 2007.

[Hagiwara and Sekine 2011] Masato Hagiwara and Satoshi Sekine.

Latent Class Transliteration based on Source Language Origins. ALC-HLT, 2011.

[Hagiwara and Sekine 2012] Masato Hagiwara and Satoshi Sekine.

Latent Semantic Transliteration using Dirichlet Mixture. NEWS 2012.

[Jiampojamarn et al. 2007] Sittichai Jiampojamarn, Grzegorz Kondrak and Tarek Sherif.

Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion,

NAACL 2007.

[Jiampojamarn et al. 2008] Sittichai Jiampojamarn, Colin Cherry, Grzegorz Kondrak.

Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion, NAACL 2008.

[Cherry and Suzuki 2008] Colin Cherry and Hisami Suzuki.

Discriminative Substring Decoding for Transliteration, EMNLP 2008.

Page 40: Word Segmentation and Transliteration in Chinese and Japanesemasatohagiwara.net/files/20130405_CUNY_NLP_Seminar.pdf · Word Segmentation in Chinese and Japanese . 5 ... References

40

References – Integrated Models

[Koehn and Knight 2003] Philipp Koehn and Kevin Knight.

Empirical Methods for Compound Splitting, EACL 2003.

[Kaji and Kitsuregawa 2011] Nobuhiro Kaji and Masaru Kitsuregawa.

Splitting Noun Compounds via Monolingual and Bilingual Paraphrasing: A Study on Japanese

Katakana Words, EMNLP 2011

[Hagiwara and Sekine 2013] Masato Hagiwara and Satoshi Sekine.

Accurate Word Segmentation using Transliteration and Language Model Projection,

ACL 2013 (to appear)