Kevin Knight on Writing Systems Naacl09-Print-1x2

Embed Size (px)

Citation preview

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    1/73

    Writing Systems,Transliteration and

    Decipherment

    Kevin KnightUniversity of Southern California

    Information Sciences Institute

    Richard SproatOregon Health & Science University

    Center for Spoken Language Understanding

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 1

    Overview

    An overview of writing systems

    Transcription/transliteration betweenscripts

    Traditional and automatic approaches todecipherment

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    2/73

    Part IWriting Systems and Encodings

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 3

    Some terminology

    A script is a set of symbols

    A writing system is a script paired with alanguage.

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    3/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 4

    What could writing systemsrepresent?

    In principle any linguistic level

    My dog likes avocados

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 5

    What do writing systems actuallyrepresent?

    Phonological information: Segmental systems:

    Alphabets Abjads Alphasyllabaries

    Syllables (but full syllabaries are rare)

    Words in partially logographic systems Some semantic information:

    Ancient writing systems like Sumerian, Egyptian,Chinese, Mayan

    But no full wri ting system gets by withoutsome representation of sound

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    4/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 6

    Roadmap

    Look at how Chinese writing works:Chinese is the only ancient writingsystem in current use, and in many ways itrepresents how all writing systems used tooperate.

    Detour slightly into semantic-only or

    logographic writing. Survey a range of options for phonological

    encoding

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 7

    The six writings

    xingxng simple pictograms person, wood, turtle

    zhshindicators above, below

    huy meaning compound bright (SUN+MOON)

    xngshngphonetic compounds oak (TREE+xing), duck (BIRD+ji)

    zhunzh redirected characters trust (PERSON+WORD)

    jiji false borrowings (rebuses) come (from an old pictograph for wheat)

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    5/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 8

    Xngshng characters95% of Chinese Characters ever invented consist

    of a semantic and aphonetic component

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 9

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    6/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 10

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 11

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    7/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 12

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 13

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    8/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 14

    A generalization of huy:Japanese kokuji ( )

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 15

    Japanese logography

    Japanese writing has three subsystems

    Two kana syllabaries, which well look at later

    Chinese characters kanji which usually have

    two kinds of readings: Sino-Japanese (on sound) readings: often a given

    character will have several of these

    Native Japanese (kunyomi) readings

    mountain

    on: san

    kun: yama

    islandon: tookun: shima

    carp

    on: ri

    kun: koi

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    9/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 16

    A generalization of xngshng:Vietnamese Ch Nm ( )

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 17

    Semantic-phonetic constructions inother ancient scripts

    Mayan

    [DIV] Nin Gal

    Urim [LOC] ma

    Sumerian

    Egyptian

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    10/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 18

    Syllabaries

    Syllables are often considered more naturalrepresentations in contrast to phonemes. E.g: investigations of language use suggest that many speakers do

    not divide words into phonological segments unless they havereceived explicit instruction in such segmentation comparable tothat involved in teaching an alphabetic writing system [Faber, Alice.1992. Phonemic segmentation as epiphenomenon. evidence from the history of alphabetic writing. InPamela Downing, Susan Lima, and Michael Noonan, eds, The Linguistics of Literacy. John Benjamins,

    Amsterdam, pages 111--34.]

    Syllabaries have been invented many times (true); the

    alphabet was only invented once (not so clearly true) But: very few systems exist that have a separate symbol

    for every syllable of the language: Most are defective or at least partly segmental

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 19

    Linear B (ca 1600-1100 BC)

    Derived from an earlier script (Linear A), which wasused to write an unknown language (Minoan)

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    11/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 20

    Linear B

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 21

    Cherokee (1821)

    Sequoyah(George Gist)(1767 - 1843)

    u-no-hli-s-di

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    12/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 22

    Kana

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 23

    Yi

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    13/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 24

    Segmental writing

    Somewhere around 3000 BC, the Egyptiansdeveloped a mixed writing system whosephonographic component was essentiallyconsonantal hence segmental

    One hypothesis as to why they did this is thatEgyptian like distantly related Semitic had aroot and pattern type morphology. Vowel changes indicated morphosyntactic

    differences; the consonantal root remained constant

    Thus a spelling that reflected only consonants wouldhave a constant appearance across related words

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 25

    Egyptian consonantal symbols

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    14/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 26

    Proto-Sinaitic(aka Proto-Canaanite) script

    Somewhere around 2000 BC, Semitic speakersliving in Sinai, apparently influenced byEgyptian, simplified the system and devised aconsonantal alphabet

    This was a completely consonantal system:

    No matres lectionis using consonantal symbols to

    represent long vowels as in later Semitic scripts Phoenician (and other Semitic scripts) evolved

    from this script

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 27

    Proto-Sinaitic script

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    15/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 28

    Later Semitic scripts:vowel diacritics

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 29

    The evolution of Greek writing

    Greek developed fromPhoenician

    Vowel symbols developed byreinterpreting or maybemisinterpreting Phoenician

    consonant symbols The alphabet is often described

    as only having been inventedonce. But thats not really true: the

    Brahmi and Ethiopicalphasyllabaries developedapparently independently, fromSemitic

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    16/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 30

    Alphasyllabaries: Brahmi(ca 5th century BC)

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 31

    Some Brahmi-derived scripts

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    17/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 32

    Basic design of Brahmi-derivedalphasyllabaries

    Every consonant has an inherent vowel This may be canceled by an explicit cancellation sign (virama in

    Devanagari, pulli in Tamil) Or replaced by an explicit vowel diacritic

    In many scripts consonant groups are written with someconsonants subordinate to or ligatured with others

    In most scripts of India vowels have separate full anddiacritic forms: Diacritic forms are written after consonants

    Full forms are written syllable or word initially In most Southeast Asian scripts (Thai, Lao, Khmer ) this

    method is replace by one where all vowels are diacritic, andsyllables with open onsets have a special empty onset sign. (Wewill see this method used again in another script.)

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 33

    Devanagari vowelsInherent

    vowel

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    18/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 34

    Kannada diacritic vowels

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 35

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    19/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 36

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 38

    Another alphasyllabary:Ethiopic (Geez) (4th century AD)

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    20/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 39

    Correct sounds for instructing the people()

    The origin of Korean Writing

    The speech of our country differs from that

    of China, and the Chinese characters do

    not match it well. So the simple folk, if they

    want to communicate, often cannot do so.

    This has saddened me, and thus I have

    created twenty eight letters. I wish that

    people should learn the letters so that they

    can conveniently use them every day.

    King Sejong the Great(Chosun Dynasty, 1446)

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 40

    Hangul symbols

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    21/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 41

    Design principles of Hangul

    For consonants based onthe position of articulation

    Vowels made use of thebasic elements earth(horizontal line) and

    humankind (verticalline)

    k looks like

    the tongue rootclosing thethroat

    u as in themiddle sound ofjun.

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 42

    Design principles of Hangul

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    22/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 43

    Summary

    Writing systems represent language in avariety of different ways

    But all writing systems represent sound tosome degree

    While syllabaries are indeed common,

    virtually all syllabaries require someanalysis below the syllable level

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 44

    Encodings: Unicode

    Character encodings are arranged intoplanes

    A plane consist of 65,536 (1000016) code

    points There are 17 planes (0-16) with Plane 0 being

    the Basic Multilingual Plane

    Texts are encoded in logical order, whichis more abstract than the presentationorder

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    23/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 45

    Types of code points

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 46

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    24/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 47

    Example: Devanagari Code Points

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 48

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    25/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 49

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 50

    Example of Logical Ordering: Tamil /hoo/

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    26/73

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    27/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 53

    Unicode encoding schemes

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 54

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    28/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 55

    Issues with Unicode

    The design principles are nice, but they areinconsistently applied: In Brahmi-derived alphasyllabaries each consonant

    and vowel has a separate code point. Not so in Ethiopic

    In Indian alphasyllabaries, logical order is strictlyenforced

    Not so in Thai and Lao As we saw in the Tamil example, Unicode allowsfor variants for encoding the same information

    The term ideograph should never have becomeenshrined as the term for Chinese characters

    Part II

    Transcription (Transliteration)

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    29/73

    When Languages Collide

    At the border crossing (before writing):

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 57

    W UH T ZY ERN EY M ?

    AA KH M EH DH

    AE K M EH D ?N OW,

    AA KH M EH DH

    OW K EY, W IY LJH UH S T

    K AH LY UW

    AE K M EH D

    Phonemic transfer

    Twospokenforms

    When Languages Collide

    At the border crossing (after writing):

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 58

    I need totype yourname.

    Heres mypassport.

    Whats this say?Its a bunch ofsquiggly lines.

    AA KH M EH DH

    AE K M EH D ?Argh ...Fine.

    Ackmed.

    Textual transfer

    Twowrittenforms

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    30/73

    When Languages Collide

    Japanese/English example:

    KEVIN KNIGHT English writing

    K EH V IH N N AY T English sounds

    K E B I N N A I T O Japanese sounds

    Japanese writing

    V B: phoneme inventory mismatch T T O: phonotactic constraint

    alphabetic vs. syllabic writing

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 59

    When Languages Collide

    Common translation problem

    People and place names

    New technical terms, borrowings

    Challenging when source and target

    languages have:

    different phoneme inventories

    different phonotactic constraints

    different writing systems

    English, Japanese, Russian, Chinese,Arabic, Greek

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 60

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    31/73

    Streets of Tokyo / Katakana

    Forward vs BackwardTranscription

    Forward transcription

    Import foreign term / name

    Newt Gingrich may be several ways to

    transcribe into Arabic Generally flexible

    Backward transcription

    Recover original term / name

    Usually only one right answer

    Newt Gingrich (not Newt Kinkridge)

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 62

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    32/73

    Japanese News

    121512325673

    75324655

    9

    221332218

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 63

    taigaa uzzu

    chyado kyanberu

    kenii perii

    iibunpaa

    Chinese/English

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 64

    Whats myname inJapanese?

    KEBIN.NAITO

    Whats myname inChinese?

    Great! I woulddo it like this

    No! Moreappealinglike this

    Thats good,but this writtencharacter hasa more pleasingconnotation

    Hi, what are you guys doing?I brought chips and soda

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    33/73

    Chinese

    Several hundred syllables in inventory Must stick to this idiosyncratic set

    Washington Hua Sheng Dun

    No other syllables are easily written

    Homophony: after we decide on syllables,

    many characters to choose from Washington Hua Sheng Dun

    Transcription vs Translation

    Kevin Knight Nai Kai Wen or Wu Kai WenKnight/Sproat Writing Sys tems, Translit eration and Decipherment 65

    Translation versus Transcription

    Sometimes things are translated insteadof transcribed Japanese: computer

    (konpyuutaa) Chinese: computer

    (dian nao) (electric brain)

    Arabic: Southern California (Janoub Kalyfornya)

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 66

    transliterated translated

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    34/73

    An Interesting Case:Whats Going On Here?

    Observed English/Japanese transcription:

    Tonya Harding toonya haadingu

    Tanya Harding taanya haadingu

    Perhaps transcription is sensitive tosource-language orthography

    Or perhaps the transcriber is mentallymis-pronouncing the source-languageword

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 67

    A Model of Transcription

    KEVIN KNIGHT English writing

    K EH V IH N N AY T English sounds

    K E B I N N A I T O Japanese sounds

    Japanese writing

    Suppose we believe these are the steps.

    We can model each step with a weighted finite-state transducer (WFST), and employ ClaudeShannons noisy-channel model.

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 68

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    35/73

    A Model of Transcription

    Angela Knight

    WFST B

    WFSA A

    WFST D

    AE N J EH L UH N AY T

    WFST C

    a n j i r a n a i t o

    [Knight & Graehl 98]

    MODELING

    DIRECTION

    DECODING

    DIRECTION

    A Model of Transcription

    Angela Knight

    WFST B

    WFSA A

    WFST D

    AE N J EH L UH N AY T

    WFST C

    a n j i r a n a i t o

    [Knight & Graehl 98]

    SPELLING

    TO SOUND

    TRANSDUCER

    SOUND TO

    SPELLING

    TRANSDUCER

    PHONEMIC

    TRANSFER

    TRANSDUCER

    LANGUAGE

    MODEL

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    36/73

    Spelling to Sound Transducer

    Richard talked about writing systems. Such a system captures an infinite relation of

    pairs.

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 71

    CAT : : K

    : AE

    : TSCAT :

    : S

    :

    WFSTwords soundssounds words

    Learning SequenceTransformation Probabilities

    Ideal training data:

    P(n | M) = 0.5P(m u | M) = 0.5

    need much more data,of course

    Actual training data:

    Automatically align string pairs using the unsupervised

    Expectation-Maximization (EM) algorithm.

    etc

    etc

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    37/73

    L r 0.621

    r u 0.362

    AH a 0.486o 0.169

    e 0.134

    i 0.111

    u 0.076

    English-Japanesephonemictransfer patterns

    learned fromparallelsequences

    Learned byEM algorithm

    [Knight & Graehl 98]

    WFST

    WFST B

    WFST D

    WFST C

    a n j i r a n a i t o

    AE N J IH R UH N AY T

    AH N J IH L UH N AY T OH+ millions more

    + millions more

    + millions more

    DECODING

    WFSA A

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    38/73

    A Model of Transcription

    Angela Knight

    WFST B

    WFSA A

    WFST D

    AE N J EH L UH N AY T

    WFST C

    a n j i r a n a i t o

    Can this transformation

    be learned from

    non-parallel data?

    I.e., can katakana be

    deciphered without

    parallel text?

    Well return to this

    later

    Decipherment section

    Intermission

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    39/73

    Alternative: Mapping CharacterSequences Directly

    KEVIN KNIGHT English writing

    KE VI N KN IGH T English letter chunks

    Japanese writing

    Dispenses with spelling-to-sound modelsand pronunciation dictionaries

    Can be learned from parallel data usingstatistical MT-like techniques (overcharacters instead of words)

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 77

    Hybrid Mapping Models

    Sound-based and character-basedmethods can be combined

    [Al-Onaizan & Knight 02]

    [Bilac & Tanaka 04, 05]

    [Oh & Choi 2005, Oh et al 06]

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 78

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    40/73

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    41/73

    Use of Transcription inMachine Translation Systems

    Another approach [Hermjakob et al 08]

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 81

    Bilingual

    Training

    Corpus

    Transliteration Model

    Bilingual corpus, eachside with transliterateditems identified & marked

    Source side only, withtransliterated items marked(throw away target side)

    Trained monolingual transliterate metagger (doesnt just tag names!)

    Test

    CorpusTagged testcorpus

    New suggestedphrasal translations(not mandatory use)

    MT system

    Other Uses ofTranscription Models

    Cross-lingual Information retrieval, eg, [Gao et al 04]

    Recognize transcriptions in comparable corpora, eg, [Sproat et al 06]

    Regional studies, eg, [Kuo et al 09]

    Automatic speech recognition

    Phonemic transfer models might adjust for non-native speakers?

    Normalization of informal Internet Romanization schemes

    Greek, Arabic, Russian http://www.translatum.gr/converter/greeklish-converter.htm

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 82

    Cypriot Greeklish with InstantMessaging Shorthand:

    ego n 3ero re pe8kia..

    skeftoume skeftoume omos

    tpt..

    Normalized for automatic indexing ortranslation:

    ...

    ...see Greeklish, Wikipedia

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    42/73

    Overview of theTransliteration/Transcription

    Literature

    We have only touched on what is a large literature.

    http://www.cs.mu.oz.au/~skarimi/

    S. Karimi, F. Scholer, A. Turpin, A Survey onMachine Transliteration Literature, (SubmittedDec 08, Review received 31 Mar 09) UnderRevision for ACM Computing Surveys.

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 83

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 84

    Discriminative models

    Often used in judging potentialtranscription pairs in comparable corporasince here one is merely trying to classify

    the pair We will briefly review two pieces of work:

    Klementiev & Roth 2006

    Some results from the 2008 JHU summerworkshop

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    43/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 85

    Klementiev & Roth 2006

    Named entities (NEs) in onelanguage co-occur with theircounterparts in the other Hussein has similar temporal

    histogram in both corpora

    Different from histogram of wordRussia

    NEs are often transcribed

    Approach is an iterative

    algorithm which exploits thesetwo observations

    Given a bilingual corpus oneside of which is tagged, itdiscovers NEs in the otherlanguage

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 86

    Klementiev & Roth 2006

    A linear discriminative approach fortranscription model M

    Use the perceptron algorithm to train M The model activation provides the score used to

    select best transcriptions

    Initialize M with a (small) set of transcriptions aspositive examples and non-NEs paired withrandom words from T as negative examples

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    44/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 87

    Klementiev & Roth 2006

    Features for the linear model M are: For a pair of NE and a candidate (ES, ET) partition Es and ET into

    substrings of length 0 to n

    Each feature is a pair of substrings

    For example, (ES, ET) = (powell, pouel), n = 2 Es {_, p, o, w, e, l, l, po, ow, we, el, ll}

    ET {_, p, o, u, e, l, po, ou, ue, el}

    Feature vector is thus ((p,_), (p, a), (w, au), (el, el),(ll, el))

    Use an observation that transcription tends to preservephonetic sequence to limit the number of features

    E.g. disallow couplings whose starting positions are too far apart(e.g. (p, ue) in the above example).

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 88

    Klementiev & Roth 2006

    S TBilingual comparable corpus (S,T)

    Set of Named Entities in S

    Input

    M

    Initialization

    Initialize transcription model M

    Repeat

    Collect candidates in T with highscore (according to current M)

    For each candidate, collect time distribution

    Add best temporally aligned candidate to D

    D

    Use D to train M

    Until D stops changing

    For each NE in S

    D

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    45/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 89

    Klementiev & Roth

    Algorithm iteratively refines transcription modelwith the help of time sequence similarity scoring

    Current transcription model chooses a list of candidates

    Best temporally aligned candidate is used for nextround of training

    Example transcription candidate lists for NE f o r s y t h for two iterations[correct is ]

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 90

    Representation matters(dont simply conclude that one should build a

    model based solely on orthography)

    Some results from the 2008 JHU CLSPWorkshop on Multilingual Spoken TermDetection

    Train a perceptron-based discriminative model on aChinese-English name dictionary with 71,548 entries(90% training, 5% held-out, 5% testing)

    Compare features based on pairing: English letters with Chinese letters (pinyin) (EL-CL)

    English letters with Chinese phonemes (EL-CP)

    English phonemes with Chinese phonemes (EP-CP)Pinyin is a relatively abstract phonemic representation that isnot a particularly accurate representation of the pronunciation

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    46/73

    Part IIIDecipherment

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 92

    Thomas Young

    Jean FranoisChampollion

    Henry CreswickeRawlinson

    Georg FriederichGrotefend

    Michael Ventris

    Some decipherers

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    47/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 93

    Not everything is decipherable

    The Phaistos Disk:

    Most serious scholars think the text is too short

    A recent find from Jiroft (Iran)

    Many suspect this Is a fake

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 94

    Symbols for the major deities of Aurnairpal II

    Not everything that consists oflinearly arranged symbols is writing

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    48/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 95

    Not every communicative symbolsystem is writing

    Naxi text

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 96

    Questions that have to be asked

    Is the artifact genuine?

    Is the symbol system linguistic or non-linguistic?

    If you have bilingual text that can help answerthe question

    What is the underlying language?

    Which direction was the text read in?

    What kind of writing system are we dealingwith?

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    49/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 97

    Techniques and issues

    Bilingual texts; names

    Structural analysis

    Verification

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 98

    Parallel and comparable texts in Egyptian(Young & Champollion, 1816 onwards)

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    50/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 99

    Parallel and comparable texts inEgyptian

    p

    t w

    l

    m

    syy

    w

    p t

    l y

    k 3 3t

    r

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 100

    Parallel text --- without parallel text

    In early September 2008, many people werefocussed on Hurricane Gustav, and whatdamage it might inflict upon the US oil industry in

    the Gulf of Mexico, or on the city of NewOrleans

    If you looked in Chinese newspapers at that timeyoud find mention of (gstf)

    Proper names are often an implicit source ofparallel text

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    51/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 101

    Grotefends (1800) decipherment ofOld Persian

    Grotefend expected to find the namesDarius and Xerxes in an inscription fromPersepolis

    Grotefend guessed that vertical bars in theinscription were word separators

    By the large number of symbols between theseparators he reasoned that the system mustbe alphabet. (Actually that turned out to bewrong.)

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 102

    Grotefends decipherment of OldPersian

    From later Persian (Avestan) texts a few things wereknown: Kings were designated in a very formulaic way: X, great king,

    king of kings son of Y

    Xerxes and Dariuss names were something like xhere and

    darheu The later word for king was keio

    From history it was known that Xerxes was the son ofDarius, and Darius the son of Hystapes (who was not aking)

    Grotefend reasoned the inscriptions might be: Xerxes great King son of Darius

    Darius great King son of Hystapes

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    52/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 103

    A, great king, son of BB, great king son of C

    xhere

    darheu

    darheu

    hystapes

    keio

    keio

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 104

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    53/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 105

    Structural analysis: Linear B(Michael Ventris, early 1950s)

    Kobers triplets:

    Ventris grid

    ru ki toru ki ti joru ki ti jaLuktos

    a mi ni soa mi ni si joa mi ni si jaAmnisos

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 106

    Verification: Linear B

    The phonology of many wordscorresponded to what was suspected forGreek from the relevant period:

    wa-na-ka (*wanaks, later anax `ruler)i-qo (*iqqwos, later hippos `horse)

    No definite articles

    Confirmation from new finds by CarlBlegen: ti ri po de qe to ro we qwetrwes

    tetr-

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    54/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 107

    Verification: Babylonian

    Babylonian is a complex mixed script. The decipherment by Henry Creswicke Rawlinson and others

    seemed so arcane that many people doubted the decipherment

    In 1857 the Royal Asiatic Society received a letter fromW.H. Fox Talbot containing a sealed translation of a textfrom the reign of Tiglath Pileser I (Middle Assyrianperiod, 11141076 BC)

    Talbot proposed comparing this with Rawlinsonstranslation, which was soon to be published

    Rawlinson not only agreed with this proposal, butsuggested that two further scholars Edward Hincksand Jules Oppert be asked to provide translations.

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 108

    Verification: Babylonian

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    55/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 109

    VerificationA text from Abu Simbel

    S Sra ms

    On the Rosetta Stone, (2) was found to be aligned with the Greek wordgenethlia birthday: the Coptic word for birth was mse confirming the msreading for this glyph

    Tuthmosis

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 110

    How complete must a decipherment befor it to be verified?

    From Ventris & Chadwick, Documents in Mycenaean Greek

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    56/73

    Prospects for AutomaticDecipherment

    Automatic decipherment is whycomputers were invented, in the 1940s

    Of course, military ciphers are differentfrom unknown scripts

    But similar skills and techniques mayapply

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 111

    Letter Substitution Cipher

    Plaintext: HELLO WORLD ...

    Secret encipherment key:PLAI N: ABCDEFGHI J KLMNOPQRSTUVWXYZ

    CI PHER: PLOKMI J NUHBYGVTFCRDXESZAQW Ciphertext: NMYYT ZTRYK ...

    Key is unknown to code-breaker

    What key, if applied to the ciphertext, would

    yield sensible plaintext?

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    57/73

    . . . .

    KDCY LQZKTLJQX CY MDBCYJQL: TR

    . . . . . .HYD FKXC, FQ MKX RLQQIQ HYDL

    . . . .

    MKL DXCTW RDCDLQ JQMNKXTMB

    . . . . .

    PTBMYEQL K FKH CY LQZKTL TC.

    AB 3C 8

    D 7 #E 1 .F 3 .GH 3 .I 1 .

    J 3 .K 9 ##### VL 10 ##M 6 #N 1 .OP 1 .Q 11 ######### V

    R 3 .S

    T 7 ### VUVW 1 .X 5

    Y 7 #### VZ 2 .

    a o e. a . e o o. e .KDCY LQZKTLJQX CY MDBCYJQL: TR

    . o . a . e a . ee. e . o

    HYD FKXC, FQ MKX RLQQIQ HYDL

    a . . e . e . a

    MKL DXCTW RDCDLQ JQMNKXTMB

    . o. e a . a. o e. a

    PTBMYEQL K FKH CY LQZKTL TC.

    AB 3C 8D 7 #E 1 .F 3 .GH 3 .I 1 .

    J 3 .K 9 ##### V

    L 10 ##M 6 #N 1 .OP 1 .Q 11 ######### VR 3 .S

    T 7 ### VUVW 1 .X 5

    Y 6 #### V

    Z 2 .

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    58/73

    aut o r epai r men t o cust omer i f

    KDCY LQZKTLJQX CY MDBCYJQL: TR

    you wai t we can f r eeze yourHYD FKXC, FQ MKX RLQQIQ HYDL

    car unt i l f ut ur e mechani cs

    MKL DXCTW RDCDLQ JQMNKXTMB

    di scover a way t o r epai r i t

    PTBMYEQL K FKH CY LQZKTL TC.

    AB 3C 8

    D 7 #E 1 .F 3 .GH 3 .I 1 .

    J 3 .K 9 ##### VL 10 ##M 6 #N 1 .OP 1 .Q 11 ######### V

    R 3 .S

    T 7 ### VUVW 1 .X 5

    Y 6 #### VZ 2 .

    Letter Substitution Cipher

    How little knowledge of the plaintextlanguage is necessary for decipherment?

    Simple letter-based n-gram models

    P(a | t) -- given t, chance that next letter is a

    EM-based decipherment

    [Knight et al 06]

    Integer-programming-based decipherment

    [Ravi & Knight 08]

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 116

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    59/73

    Letter Substitution Cipher

    [Ravi & Knight 08]

    3-gram letter-basedlanguage model

    of English used fordecipherment

    Unknown Script as a Cipher

    ciphertext

    (Linear B tablet)

    Greeksounds

    ciphertext

    (Mayan writing)

    ModernMayansounds

    make the text speak

    ?

    ?

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    60/73

    Unknown Script as a Cipher

    ciphertext (6980 letters)

    primera parte

    del ingenioso

    hidalgo don

    (Don Quixote)

    ModernSpanishsounds

    ?

    26 sounds:

    B, D, G, J (canyon),

    L (yarn), T (thin), a,b, d, e, f, g, i, k, l,m, n, o, p , r,rr (trilled), s,t, tS, u, x (hat)

    32 letters:

    , , , , , ,

    a, b, c, d, e, f, g,h, i, j, k, l, m, n,o, p, q, r, s, t, uv, w, x, y, z

    ?

    [Knight & Yamada, 1999]

    ?

    Unknown Script as a Cipher

    ciphertext (6980 letters)

    primera parte

    del ingenioso

    hidalgo don

    (Don Quixote)

    ModernSpanishsounds

    P(c | p) =P(c1 | p1) *

    P(c2 | p2) *P(c3 | p3) *

    Phoneme-to-letter modelP(y | L) = 0.8 ?

    P(p) =P(p1 | START) *

    P(p2 | p1) *P(p3 | p2) *

    Phonemebigram modelP(L | tS) = 0.003

    ??

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    61/73

    Ideal Key

    B b or vD d

    G g

    J

    L l l or y

    a a or

    b b or v

    d d

    e e or

    f f

    g g

    i i or

    l l

    m m

    n n

    o o or

    p p

    r r

    t t

    tS c h

    u u or

    x j

    nothing h

    T (before a, o, u) z

    T (before e or I) c or z

    T (otherwise) c

    k (before e or I) q u

    k (before s) x

    k (otherwise) c

    rr (start of word) r

    rr (otherwise) rr

    s (after k) nothing

    s (otherwise) s

    sound letter sound letter

    Unknown Script as a Cipher

    ciphertext (6980 letters)

    primera parte

    del ingenioso

    hidalgo don

    (Don Quixote)

    ModernSpanishsounds

    ?

    EM-based decipherment finds a very good key and achieves 93%phoneme accuracy

    Correct sounds: primera parte del inxenioso iDalGo don kixoteDeciphered sounds:primera parte del inGeniosobiDalGo don kixote

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    62/73

    How to Decipher Unknown Script ifSpoken Language is Also Unknown?

    One idea: build a universal model P(s) of humanphoneme sequence production

    Human might generally say: K AH N AH R IY

    Human wont generally say: R T R K L K

    Deciphering means finding a P(c | p) table such thatthere is a decoding with a good universal P(p) score

    Universal Phonology

    Linguists know lots of stuff!

    Phoneme inventory if z, then s

    Syllable inventory

    all languages have CV (consonant-vowel) syllables if VCC, then also VC

    Syllable sonority structure {stdbptk} {mnrl} {V} {mnrl} {stdbptk}

    dram, lomp, tra, ma, ? rdam, ? lopm, ? tba, ? mla

    Physiological preference constraints tomp, tont, tongk, ? tomk, ? tonk, ? tongt, ? tonp

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    63/73

    Universal Phonology

    primera parte

    del ingenioso

    hidalgo don

    humansoundingsequence

    ??

    Task 1: Label each letter with a phoneme

    syllable

    type

    sequence

    consonant/

    vowel

    sequences

    # of

    syllables

    in word

    Universal Phonology

    primera parte

    del ingenioso

    hidalgo don

    Task 2: Label each letter with a phoneme class: C or V

    Input: primera parte del ingenioso hidalgo don Output: CCVCVCV CVCCV CVC VCCVCVVCV CVCVCCV CVC

    P(1) = ?

    P(2) = ?etc.

    P(CV) = ?

    P(V) = ?P(CVC) = ?+ 7 other types

    P(V | V) = ?P(VV | V) = ?

    P(a | V) = ?P(a | C) = ?etc.

    P(CV) = 0.45 P(VC) = 0.09

    P(V) = 0.15 P(CVC) = 0.22

    P(CCV) = 0.02 P(CCVC) = 0.01

    P(a | V) = 0.27 P(a | C) = 0.00

    P(b | V) = 0.00 P(b | C) = 0.04

    P(c | V) = 0.00 P(c | C) = 0.07

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    64/73

    Another idea: brute force

    If we dont know the spoken language,simply decode against all spokenlanguages: Pre-collect P(p) for 300 languages

    Train a P(c | p) using each P(p) in turn

    See which decoding run assigns highest P(c)

    Hard to get phoneme sequences

    Can use text sequence as a substitute

    Unknown Source Language

    UN Declaration of HumanRights

    No one shall be arbitrarily deprived of his property

    Niemand se eiendom sal arbitrr afgeneem word nie

    Asnjeri nuk duhet t privohet arbitrarisht nga pasuria e tij

    Janiw khitisa utaps oraqeps inaki aparkaspati

    Arrazoirik gabe ez zaio inori bere jabegoa kenduko

    Den ebet ne vo tennet e berc'hentiezh diganta diouzh c'hoant

    H

    Ning no ser privat arbitrriament de la seva propietat

    Di a so prupiit n ni p essa privu nimu di modu tirannicu

    Nitko ne smije samovoljno biti lien svoje imovine

    Nikdo nesm bt svvoln zbaven svho majetku

    Ingen m vilkrligt berves sin ejendom

    Niemand mag willekeurig van zijn eigendom worden beroofd

    Nul ne peut tre arbitrairement priv de sa proprit

    Nimmenmei samar fan syn eigendom berve wurde

    Ningun ser privado arbitrariamente da sa propiedade

    Niemand darf willkrlich seines Eigentums beraubt werden

    Avavgui ndojepe'a va'eri oimehicha reinte imbe teva

    Ba wanda za a kwace wa dukiyarsa ba tare da cikakken dalili ba

    Senkit sem lehet tulajdontl nknyesen megfosztani

    Engan m eftir getta svipta eign sinni

    Necuno essera private arbitrarimente de su proprietate

    N fidir a mhaoin a bhaint go forlmhach de dhuine ar bith

    Al neniu estu arbitre forprenita lia proprieto

    Kelleltki ei tohi tema vara meelevaldselt ra vtta

    Eingin skal hissini vera fyri ongartku

    Me kua ni dua e kovei vua na nona iyau

    Keltn lkn mielivaltaisesti riistettk hnen omaisuuttaan

    Exists in many of worlds languages, UTF-8 encoding

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    65/73

    Unknown Source Language

    Input:cevzr en cnegr qry vat r avbf b uvqnyt b qba dhvwbgr qr yn znapun

    Top 5 languages with best P(c) after deciphering:-5.29120 spanish

    -5.43346 galician

    -5.44087 portuguese

    -5.48023 kurdish

    -5.49751 romanian

    Best-path decoding assuming plaintext is Spanish:pr i mera par t e del i ngeni oso hi dal go don qui j ote de l a mancha

    Best-path decoding assuming plaintext is English:wi zar i s asi ve bek u- gedundl pubscon bl y whual ve be ks asequs

    Simultaneous language ID and decipherment

    Transliteration as a Cipher

    Ciphertext: Japanese Katakana

    Plaintext: English

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 130

    [Ravi & Knight 09]

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    66/73

    Foreign Language as a Cipher?

    Ciphertext: Billions of words of Albanian

    Plaintext: English

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 131

    Is it possible to train statistical MT systemswith little or no parallel text?

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 132

    Whats left to decipher?

    Proto-Elamite

    Linear A

    Etruscan

    rongorongo

    Indus Valley

    Phaistos disk

    Epi-Olmec and other Mesoamericanscripts

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    67/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 133

    Linear A(Crete, ca 2000 BC to 1200 BC)

    Clearly the precursor ofLinear B

    Mostly accounting texts(like Linear B), thoughthere are other kinds ofinscriptions

    We can read the textsbut we dont knowmuch about theunderlying language.

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 134

    Etruscan(Italy, 700 BC 1st Century AD)

    The alphabet is known it was derived fromGreek and was the

    precursor to Latin The language (like that

    of Linear A) is largelyunknown

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    68/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 135

    Proto-Elamite(Iran, ca. 3100 2900 BC)

    Possibly as many as5,500 distinct signs (?)

    Underlying language isunknown may beElamite (cf later linearElamite inscriptions)

    but that is not clear

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 136

    rongorongo(Easter Island 19th Century)

    About 600 zoomorphic andanthropomorphic glyphs

    Extant corpus is about 12,000 glyphslong, all carved on driftwood

    The underlying language (Rapanui) isknown

    Ethnographic accounts of therongorongo ceremonies exist

    Claims to the contrary aside, there isno evidence this was a writing systemin the normal sense. The only bit of text that has been

    deciphered is a calendar

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    69/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 137

    Indus Valley(South Asia, 26th20th century BC) System with a few

    hundred glyphs Inscriptions are very short

    longest on a singlesurface has 17 glyphs

    The standard theory,due to Asko Parpola, isthat this was a Dravidianlanguage

    Recently, Farmer, Witzeland Sproat argued thatthis was not a writing-system at all

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 138

    Phaistos disk(Crete, ca 1800 BC??) 241 tokens with 45 distinct

    glyphs Glyphs are all pictographic

    images of animals, people,various objects

    Text is on both sides of disk ina spiral working from the

    outside The Phaistos Disk is the

    worlds first known printeddocument

    There has been a recentsuggestion (by ancient artdealer Jerome Eisenberg) thatit may be a fake

    In any case, the text is tooshort to allow for a verifiabledecipherment

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    70/73

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    71/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 141

    Further Reading

    Encoding: there are many documents on theweb that discuss encoding issues,including various documents from theUnicode Consortium.

    However, one of the best starting places is:http://www.joelonsoftware.com/articles/Unicode.html

    Further Reading

    Transliteration/Transcription

    http://www.cs.mu.oz.au/~skarimi/

    S. Karimi, F. Scholer, A. Turpin, A Survey onMachine Transliteration Literature, (SubmittedDec 08, Review received 31 Mar 09) UnderRevision for ACM Computing Surveys.

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 142

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    72/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 143

    Further Reading

    Discriminative models of transcription

    1. A. Klementiev and D. Roth. 2006. Weakly supervisednamed entity transliteration and discovery frommultilingual comparable corpora. InACL.

    2. D. Zelenko and C. Aone. 2006. Discriminative methodsfor transliteration. In EMNLP.

    3. S-Y. Yoon, K-Y. Kim, and R. Sproat. 2007. Multilingual

    transliteration using feature based phonetic method. InACL.

    4. D. Goldwasser and D. Roth. 2008. Active sampleselection for named entity transliteration. InACL.

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 144

    Further Reading

    Decipherment

    1. R. Parkinson. 1999. Cracking Codes: The Rosetta Stone andDecipherment. University of California Press, Berkeley.

    2. M. Pope. 1999.The Story of Decipherment: From Egyptian

    Hieroglyphs to Maya Script. Thames and Hudson, London.

    3. A. Robinson. 2002. The Man who Deciphered Linear B: TheStory of Michael Ventris. Thames and Hudson, London.

    4. A. Robinson. 2009. Lost Languages: The Enigma of theWorlds Undeciphered Scripts. Thames and Hudson, London.

  • 8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2

    73/73

    Knight/Sproat Writing Sys tems, Translit eration and Decipherment 145

    Further Reading

    Auto Decipherment

    1. A Computational Approach to Deciphering Unknown Scripts,(K. Knight and K. Yamada), Proceedings of the ACL Workshopon Unsupervised Learning in Natural Language Processing,1999.

    2. Unsupervised Analysis for Decipherment Problems", (K.Knight, A. Nair, N. Rathod, and K. Yamada), Proc. ACL-COLING (poster), 2006.

    3. Attacking Decipherment Problems Optimally with Low-OrderN-gram Models, (S. Ravi and K. Knight), Proc. EMNLP, 2008.

    4. Learning Phoneme Mappings for Transliteration withoutParallel Data, (S. Ravi and K. Knight), Proc. NAACL, 2009.

    the end