28
Transliteration Transliteration CS 626 CS 626 course seminar by course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906

Transliteration

  • Upload
    sai

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

Transliteration. CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906. Humans transliterate frequently for different reasons Can a machine do this? (Why would a machine have to do this?) If yes, how?. - PowerPoint PPT Presentation

Citation preview

Page 1: Transliteration

TransliterationTransliteration

CS 626 CS 626 course seminar bycourse seminar byPurva Joshi 08305907 Mugdha Bapat 07305916

Aditya Joshi 08305908 Manasi Bapat 08305906

Page 2: Transliteration

Humans transliterate frequently for different reasons

Can a machine do this?(Why would a machine have to do this?)

If yes, how?

Picture courtesy: Snapshot of Yahoo! Messenger

Page 3: Transliteration

Motivation • An important component of machine

translation• When you cannot translate, transliterate• Generally used for named entities, technical

terms and out of vocabulary words (OOV)• Issues specific to sounds, scripts and

accents• Can a machine do this? If yes, how?

Page 4: Transliteration

• Task of converting a word from one alphabetic script to anotherUsed for:

• Named entities• : Gandhiji • Out of vocabulary words• : Bank

What is transliteration?

Page 5: Transliteration

• Accents: Thoda or thora?

• Mapping of sounds• Mahaan: Kahaan:

• Back-transliteration

Linguistic issues

Page 6: Transliteration

Arabic Chinese Hindi/ Japanese

• Arabic b -> English p or bEnglish word: Paul transliterates toArabic word: Baul (issue in Back-transliteration)

• Origin of the proper noun determines the symbol in Chinese language

• Ideographic symbols in Chinese

• Several English symbols do not mapto any Japanese symbols. So, oftenmapped to closest sounding symbolice cream aisukuriimu

Linguistic Issues : Mapping of sounds

• Symbols map to different symbolsbased on their position

America

• Difference in originRestaurantconstant

Page 7: Transliteration

xOverview

Source String

TransliterationUnits

Target String

TransliterationUnits

Page 8: Transliteration

Contents

Source String

TransliterationUnits

Target String

TransliterationUnits

Phoneme- based

Page 9: Transliteration

Phoneme-based approach

Word inSource language

Pronunciationin

Source language

Word inTarget language

PronunciationIn

target language

P( ps | ws)

P ( pt | ps )

P ( wt | pt )

Note: Phoneme is the smallest linguistically distinctive unit of sound.

P(wt)

Wt* = argmax (P (wt). P (wt | pt) . P (pt | ps) . P (ps | ws) )

Page 10: Transliteration

Phoneme-based approach

Step I :

Consider each character of the word

Transliterating ‘BAPAT’B A P A T

P /ə//ə/ /a://a://ə//ə/ /a://a:/B T

Source word

to phonemes

P /ə//ə/ /a://a://ə//ə/ /a://a:/BT

Source phonemes

to target phonemes

t

t

Step II : Converting to phoneme seq.Step III : Converting to target phoneme seq.

Page 11: Transliteration

Phoneme-based approach

Step IV : Phoneme sequence to target stringB : /ə/ :/ə/ :

/a:/ :/a:/ :

P:/ə/ :/ə/ :

/a:/ :/a:/ :

T:

t:

Output :

Page 12: Transliteration

Concerns

Word inSource language

Pronunciationin

Source language

Word inTarget language

PronunciationIn

target language

Check if the world is validIn target language

Check if environment Is noise-free

Page 13: Transliteration

• Unknown pronunciations

• Back-transliteration can be a problem Johnson Jonson

Issues in phonetic model

sanhita

samhita

Page 14: Transliteration

Contents

Source String

TransliterationUnits

Target String

TransliterationUnits

Phoneme- based

Spelling-based

Page 15: Transliteration

• Maps source word sequences to target word sequences (i.e. direct word to word)

• The transliteration score:

• P(w)

Spelling-based model

Letter trigram model included

Thus, we can accommodate the words not included in the dictionary

Pronunciationin

Source language

PronunciationIn

target language

Word inSource language

Word inTarget language

Page 16: Transliteration

Comparison of the two methods

Page 17: Transliteration

Contents

Source String

TransliterationUnits

Target String

TransliterationUnits

Phoneme- based

Spelling-based

Joint Source

Channel

Page 18: Transliteration

• Particularly developed for Chinese• Chinese : Highly ideographic• Example :

• Two main steps:

The Third Method - Why?

Image courtesy: wikimedia-commons

Modeling Decoding

Page 19: Transliteration

Modeling Step• A bilingual dictionary in the source and target language

• From this dictionary, the character mapping between the source and target language is learnt

The word “Geo” has two possible mappings, the “context” in which it occurs is important

JohnGeorgiaGeology

Geo Geo

Modeling step

Page 20: Transliteration

Modeling step …

• N-gram Mapping :• < Geo, > < rge, >• < Geo, > < lo, >

• This concludes the modeling step

Modeling step …

Page 21: Transliteration

Decoding Step

• Consider the transliteration of the word “George”.

• Alignments of George:• Geo rge G eo rge

• Geo rge G eo rge

Decoding step

Page 22: Transliteration

Decision to be made between….

• The context mapping is present in the map-dictionary

• Using ……

Decoding step …

Page 23: Transliteration

• Where do the n-gram statistics come from?

Ans.: Automatic analysis of the bilingual dictionary

• How to align this dictionary?Ans. : Using EM-algorithm

Transliteration Alignment

Page 24: Transliteration

EM Algorithm

Bootstrap

Expectation

Maximization

TransliterationUnits

Bootstrap initial random alignmentUpdate n-gram statistics to estimate probability distribution

Apply the n-gram TM to obtain new alignment

Derive a list of transliteration units from final alignment

Page 25: Transliteration

Evaluation

E2C Error rates for n-gram tests E2C v/s C2E for TM Tests

Page 26: Transliteration

Conclusion

• Transliteration can make use of phonemes as an intermediate layer to move from a script to another

• Spelling-based approach connects the word sequences of the two languages

• The joint source channel method integrates optimization of alignment and transliteration

• no pre-alignment needed• reduction in development efforts

Page 27: Transliteration

( the end )

Page 28: Transliteration

References• For all Devnagari transliterations, www.quillpad.in/hindi/

H. Li,M. Zhang, and J. Su. 2004. A joint source-channel model for machine transliteration. In ACL, pages 159–166.

www.wikipedia.org

Y. Al-Onaizan and K. Knight. 2002. Machine transliteration of names in Arabic text. In ACL Workshop on Comp. Approaches to Semitic Languages.

K. Knight and J. Graehl. 1998. Machine transliteration. Computational Linguistics, 24(4):599–612.

N. AbdulJaleel and L. S. Larkey. 2003. Statistical transliteration for English-Arabic cross language information retrieval. In CIKM, pages 139–146.

• Joint source-channel model

• Phoneme and spelling-based models