11
Introduction to Japanese Input Method Yoh Okuno

Introduction to Japanese Input Method

Embed Size (px)

Citation preview

Page 1: Introduction to Japanese Input Method

Introduction  to  Japanese  Input  Method

Yoh  Okuno  

Page 2: Introduction to Japanese Input Method

Who  are  you?

•  Name:  Yoh  Okuno  

•  Software  Engineer  at  Yahoo!  Japan  

•  Interest:  NLP,  Machine  Learning,  Data  Mining  

•  Skill:  C/C++,  Python,  Hadoop,  etc.  

•  Website:  http://www.yoh.okuno.name/  

Page 3: Introduction to Japanese Input Method

Activities •  Winner  of  Microsoft  Speller  Challenge  

•  TokyoNLP:  Founded  NLP  community  in  Japan  

•  Social  IME:  Developed  cloud-­‐based  IME  

•  Academic  Papers  about..  

– Phrase  Extraction  for  Predictive  Input  Method  

– Large-­‐Scale  Language  Models  via  Hadoop  

Page 4: Introduction to Japanese Input Method

What  is  Japanese  Input  Method? •  Japanese  language  has  too  many  characters!  

– More  than  6,000  kanji  and  50  kana  characters  

•  We  cannot  input  directly  by  a  keyboard  L  

Page 5: Introduction to Japanese Input Method

Using  Kana  Kanji  Conversion

•  We  can  input  kana  and  convert  to  kanji.  

•  Conversion  is  ambiguous!  

•  Accuracy  is  key  issue  of  kana  kanji  conversion  

Ex:  input  good  morning

Page 6: Introduction to Japanese Input Method

Statistical  Approach •  Statistical  approach  resolves  ambiguity  well  

•  Use  corpora  and  show  frequent  words  

Corpora Model

Converter User

Train

Lookup Input:  Kana

Output:  Kanji

(Batch)

Page 7: Introduction to Japanese Input Method

Noisy  Channel  Model •  We  want  to  know  most  probable  output  

 

•  Bayes  rule  divides  it  into  two  components  

•  P(y):  Language  model    

•  P(x|y):  Pronunciation  model  (easier  task)

y = argmaxy

P (y|x)

P (y|x) ∝ P (y)P (x|y)

x:  input  kana  y:  output  kanji

Page 8: Introduction to Japanese Input Method

Language  Model

•  Sentence  is  sequence  of  words  

•  Assume  1st  order  Markov  chain  

 

•  Maximum  likelihood  estimation  

P (y) =�

i

P (yi|yi−1)

P (yi|yi−1) =C(yi, yi−1)

C(yi−1)C(y):  count  of  y                          in  corpus

Page 9: Introduction to Japanese Input Method

Viterbi  Algorithm •  Viterbi  algorithm  searches  best  path  in  lattice

Linear  time  complexity  (Dynamic  programming)

Page 10: Introduction to Japanese Input Method

Trie:  lookup  dictionary •  Tree  with  node=character  

•  Efficient  substring  search  

•  Query:  KENKYUSURU  

•  Result:  KE,  KEN,  

KENK,  KENKYU  

Page 11: Introduction to Japanese Input Method

Conclusion

•  Japanese  input  needs  special  software  

•  Kana  kanji  conversion  is  fully  statistical  task  

•  Search  and  lookup  are  interesting  algorithms  

•  Any  questions?