28
Transliteration Transliteration CS 626 CS 626 course seminar by course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

Embed Size (px)

Citation preview

Page 1: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

TransliterationTransliteration

CS 626 CS 626 course seminar bycourse seminar by

Purva Joshi 08305907 Mugdha Bapat 07305916

Aditya Joshi 08305908 Manasi Bapat 08305906

Page 2: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

Humans transliterate frequently for different reasons

Can a machine do this?

(Why would a machine have to do this?)

If yes, how?Picture courtesy: Snapshot of Yahoo! Messenger

Page 3: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

Motivation

• An important component of machine translation

• When you cannot translate, transliterate• Generally used for named entities, technical

terms and out of vocabulary words (OOV)• Issues specific to sounds, scripts and

accents• Can a machine do this? If yes, how?

Page 4: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

• Task of converting a word from one alphabetic script to another

Used for:

• Named entities

• : Gandhiji

• Out of vocabulary words

• : Bank

What is transliteration?

Page 5: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

• Accents

: Thoda or thora?

• Mapping of sounds

• Mahaan: Kahaan:

• Back-transliteration

Linguistic issues

Page 6: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

Arabic Chinese Hindi/ Japanese

• Arabic b -> English p or bEnglish word: Paul transliterates toArabic word: Baul (issue in Back-transliteration)

• Origin of the proper noun determines the symbol in Chinese language

• Ideographic symbols in Chinese

• Several English symbols do not mapto any Japanese symbols. So, oftenmapped to closest sounding symbolice cream aisukuriimu

Linguistic Issues : Mapping of sounds

• Symbols map to different symbolsbased on their position

America

• Difference in originRestaurantconstant

Page 7: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

xOverview

Source String

TransliterationUnits

Target String

TransliterationUnits

Page 8: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

Contents

Source String

TransliterationUnits

Target String

TransliterationUnits

Phoneme- based

Page 9: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

Phoneme-based approach

Word inSource language

Pronunciationin

Source language

Word inTarget language

PronunciationIn

target language

P( ps | ws)

P ( pt | ps )

P ( wt | pt )

Note: Phoneme is the smallest linguistically distinctive unit of sound.

P(wt)

Wt* = argmax (P (wt). P (wt | pt) . P (pt | ps) . P (ps | ws) )

Page 10: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

Phoneme-based approach

Step I :

Consider each character of the word

Transliterating ‘BAPAT’B A P A T

P /ə//ə/ /a://a://ə//ə/ /a://a:/B T

Source word

to phonemes

P /ə//ə/ /a://a://ə//ə/ /a://a:/BT

Source phonemes

to target phonemes

t

t

Step II : Converting to phoneme seq.Step III : Converting to target phoneme seq.

Page 11: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

Phoneme-based approach

Step IV : Phoneme sequence to target stringB : /ə/ :/ə/ :

/a:/ :/a:/ :

P:/ə/ :/ə/ :

/a:/ :/a:/ :

T:

t:

Output :

Page 12: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

Concerns

Word inSource language

Pronunciationin

Source language

Word inTarget language

PronunciationIn

target language

Check if the world is validIn target language

Check if environment Is noise-free

Page 13: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

• Unknown pronunciations

• Back-transliteration can be a problem Johnson Jonson

Issues in phonetic model

sanhita

samhita

Page 14: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

Contents

Source String

TransliterationUnits

Target String

TransliterationUnits

Phoneme- based

Spelling-based

Page 15: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

• Maps source word sequences to target word sequences (i.e. direct word to word)

• The transliteration score:

• P(w)

Spelling-based model

Letter trigram model included

Thus, we can accommodate the words not included in the dictionary

Pronunciationin

Source language

PronunciationIn

target language

Word inSource language

Word inTarget language

Page 16: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

Comparison of the two methods

Page 17: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

Contents

Source String

TransliterationUnits

Target String

TransliterationUnits

Phoneme- based

Spelling-based

Joint Source

Channel

Page 18: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

• Particularly developed for Chinese

• Chinese : Highly ideographic

• Example :

• Two main steps:

The Third Method - Why?

Image courtesy: wikimedia-commons

Modeling Decoding

Page 19: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

Modeling Step

• A bilingual dictionary in the source and target language

• From this dictionary, the character mapping between the source and target language is learnt

The word “Geo” has two possible mappings, the “context” in which it occurs is important

John

Georgia

Geology

Geo

Geo

Modeling step

Page 20: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

Modeling step …

• N-gram Mapping :

• < Geo, > < rge, >

• < Geo, > < lo, >

• This concludes the modeling step

Modeling step …

Page 21: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

Decoding Step

• Consider the transliteration of the word “George”.

• Alignments of George:

• Geo rge G eo rge

• Geo rge G eo rge

Decoding step

Page 22: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

Decision to be made between….

• The context mapping is present in the map-dictionary

• Using ……

Decoding step …

Page 23: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

• Where do the n-gram statistics come from?

Ans.: Automatic analysis of the bilingual dictionary

• How to align this dictionary?

Ans. : Using EM-algorithm

Transliteration Alignment

Page 24: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

EM Algorithm

Bootstrap

Expectation

Maximization

TransliterationUnits

Bootstrap initial random alignment

Update n-gram statistics to estimate probability distribution

Apply the n-gram TM to obtain new alignment

Derive a list of transliteration units from final alignment

Page 25: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

Evaluation

E2C Error rates for n-gram tests E2C v/s C2E for TM Tests

Page 26: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

Conclusion

• Transliteration can make use of phonemes as an intermediate layer to move from a script to another

• Spelling-based approach connects the word sequences of the two languages

• The joint source channel method integrates optimization of alignment and transliteration

• no pre-alignment needed• reduction in development efforts

Page 27: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

( the end )

Page 28: Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

References

• For all Devnagari transliterations, www.quillpad.in/hindi/

H. Li,M. Zhang, and J. Su. 2004. A joint source-channel model for machine transliteration. In ACL, pages 159–166.

www.wikipedia.org

Y. Al-Onaizan and K. Knight. 2002. Machine transliteration of names in Arabic text. In ACL Workshop on Comp. Approaches to Semitic Languages.

K. Knight and J. Graehl. 1998. Machine transliteration. Computational Linguistics, 24(4):599–612.

N. AbdulJaleel and L. S. Larkey. 2003. Statistical transliteration for English-Arabic cross language information retrieval. In CIKM, pages 139–146.

• Joint source-channel model

• Phoneme and spelling-based models