Introduction to Japanese Input Method

Introduction to Japanese Input Method

Yoh Okuno

Who are you?

•  Name: Yoh Okuno

•  Software Engineer at Yahoo! Japan

•  Interest: NLP, Machine Learning, Data Mining

•  Skill: C/C++, Python, Hadoop, etc.

•  Website: http://www.yoh.okuno.name/

Activities •  Winner of Microsoft Speller Challenge

•  TokyoNLP: Founded NLP community in Japan

•  Social IME: Developed cloud-‐based IME

•  Academic Papers about..

– Phrase Extraction for Predictive Input Method

– Large-‐Scale Language Models via Hadoop

What is Japanese Input Method? •  Japanese language has too many characters!

– More than 6,000 kanji and 50 kana characters

•  We cannot input directly by a keyboard L

Using Kana Kanji Conversion

•  We can input kana and convert to kanji.

•  Conversion is ambiguous!

•  Accuracy is key issue of kana kanji conversion

Ex: input good morning

Statistical Approach •  Statistical approach resolves ambiguity well

•  Use corpora and show frequent words

Corpora Model

Converter User

Train

Lookup Input: Kana

Output: Kanji

(Batch)

Noisy Channel Model •  We want to know most probable output

•  Bayes rule divides it into two components

•  P(y): Language model

•  P(x|y): Pronunciation model (easier task)

y = argmaxy

P (y|x)

P (y|x) ∝ P (y)P (x|y)

x: input kana y: output kanji

Language Model

•  Sentence is sequence of words

•  Assume 1st order Markov chain

•  Maximum likelihood estimation

P (y) =�

i

P (yi|yi−1)

P (yi|yi−1) =C(yi, yi−1)

C(yi−1)C(y): count of y in corpus

Viterbi Algorithm •  Viterbi algorithm searches best path in lattice

Linear time complexity (Dynamic programming)

Trie: lookup dictionary •  Tree with node=character

•  Efficient substring search

•  Query: KENKYUSURU

•  Result: KE, KEN,

KENK, KENKYU

け

ん

き

ゅ

う

こ

う

た

っ

き

ー

し

Conclusion

•  Japanese input needs special software

•  Kana kanji conversion is fully statistical task

•  Search and lookup are interesting algorithms

•  Any questions?

Technology

Introduction to Japanese Input Method