Upload
pianobks
View
216
Download
0
Embed Size (px)
Citation preview
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
1/73
Writing Systems,Transliteration and
Decipherment
Kevin KnightUniversity of Southern California
Information Sciences Institute
Richard SproatOregon Health & Science University
Center for Spoken Language Understanding
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 1
Overview
An overview of writing systems
Transcription/transliteration betweenscripts
Traditional and automatic approaches todecipherment
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
2/73
Part IWriting Systems and Encodings
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 3
Some terminology
A script is a set of symbols
A writing system is a script paired with alanguage.
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
3/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 4
What could writing systemsrepresent?
In principle any linguistic level
My dog likes avocados
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 5
What do writing systems actuallyrepresent?
Phonological information: Segmental systems:
Alphabets Abjads Alphasyllabaries
Syllables (but full syllabaries are rare)
Words in partially logographic systems Some semantic information:
Ancient writing systems like Sumerian, Egyptian,Chinese, Mayan
But no full wri ting system gets by withoutsome representation of sound
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
4/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 6
Roadmap
Look at how Chinese writing works:Chinese is the only ancient writingsystem in current use, and in many ways itrepresents how all writing systems used tooperate.
Detour slightly into semantic-only or
logographic writing. Survey a range of options for phonological
encoding
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 7
The six writings
xingxng simple pictograms person, wood, turtle
zhshindicators above, below
huy meaning compound bright (SUN+MOON)
xngshngphonetic compounds oak (TREE+xing), duck (BIRD+ji)
zhunzh redirected characters trust (PERSON+WORD)
jiji false borrowings (rebuses) come (from an old pictograph for wheat)
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
5/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 8
Xngshng characters95% of Chinese Characters ever invented consist
of a semantic and aphonetic component
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 9
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
6/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 10
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 11
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
7/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 12
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 13
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
8/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 14
A generalization of huy:Japanese kokuji ( )
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 15
Japanese logography
Japanese writing has three subsystems
Two kana syllabaries, which well look at later
Chinese characters kanji which usually have
two kinds of readings: Sino-Japanese (on sound) readings: often a given
character will have several of these
Native Japanese (kunyomi) readings
mountain
on: san
kun: yama
islandon: tookun: shima
carp
on: ri
kun: koi
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
9/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 16
A generalization of xngshng:Vietnamese Ch Nm ( )
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 17
Semantic-phonetic constructions inother ancient scripts
Mayan
[DIV] Nin Gal
Urim [LOC] ma
Sumerian
Egyptian
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
10/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 18
Syllabaries
Syllables are often considered more naturalrepresentations in contrast to phonemes. E.g: investigations of language use suggest that many speakers do
not divide words into phonological segments unless they havereceived explicit instruction in such segmentation comparable tothat involved in teaching an alphabetic writing system [Faber, Alice.1992. Phonemic segmentation as epiphenomenon. evidence from the history of alphabetic writing. InPamela Downing, Susan Lima, and Michael Noonan, eds, The Linguistics of Literacy. John Benjamins,
Amsterdam, pages 111--34.]
Syllabaries have been invented many times (true); the
alphabet was only invented once (not so clearly true) But: very few systems exist that have a separate symbol
for every syllable of the language: Most are defective or at least partly segmental
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 19
Linear B (ca 1600-1100 BC)
Derived from an earlier script (Linear A), which wasused to write an unknown language (Minoan)
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
11/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 20
Linear B
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 21
Cherokee (1821)
Sequoyah(George Gist)(1767 - 1843)
u-no-hli-s-di
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
12/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 22
Kana
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 23
Yi
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
13/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 24
Segmental writing
Somewhere around 3000 BC, the Egyptiansdeveloped a mixed writing system whosephonographic component was essentiallyconsonantal hence segmental
One hypothesis as to why they did this is thatEgyptian like distantly related Semitic had aroot and pattern type morphology. Vowel changes indicated morphosyntactic
differences; the consonantal root remained constant
Thus a spelling that reflected only consonants wouldhave a constant appearance across related words
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 25
Egyptian consonantal symbols
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
14/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 26
Proto-Sinaitic(aka Proto-Canaanite) script
Somewhere around 2000 BC, Semitic speakersliving in Sinai, apparently influenced byEgyptian, simplified the system and devised aconsonantal alphabet
This was a completely consonantal system:
No matres lectionis using consonantal symbols to
represent long vowels as in later Semitic scripts Phoenician (and other Semitic scripts) evolved
from this script
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 27
Proto-Sinaitic script
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
15/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 28
Later Semitic scripts:vowel diacritics
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 29
The evolution of Greek writing
Greek developed fromPhoenician
Vowel symbols developed byreinterpreting or maybemisinterpreting Phoenician
consonant symbols The alphabet is often described
as only having been inventedonce. But thats not really true: the
Brahmi and Ethiopicalphasyllabaries developedapparently independently, fromSemitic
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
16/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 30
Alphasyllabaries: Brahmi(ca 5th century BC)
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 31
Some Brahmi-derived scripts
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
17/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 32
Basic design of Brahmi-derivedalphasyllabaries
Every consonant has an inherent vowel This may be canceled by an explicit cancellation sign (virama in
Devanagari, pulli in Tamil) Or replaced by an explicit vowel diacritic
In many scripts consonant groups are written with someconsonants subordinate to or ligatured with others
In most scripts of India vowels have separate full anddiacritic forms: Diacritic forms are written after consonants
Full forms are written syllable or word initially In most Southeast Asian scripts (Thai, Lao, Khmer ) this
method is replace by one where all vowels are diacritic, andsyllables with open onsets have a special empty onset sign. (Wewill see this method used again in another script.)
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 33
Devanagari vowelsInherent
vowel
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
18/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 34
Kannada diacritic vowels
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 35
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
19/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 36
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 38
Another alphasyllabary:Ethiopic (Geez) (4th century AD)
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
20/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 39
Correct sounds for instructing the people()
The origin of Korean Writing
The speech of our country differs from that
of China, and the Chinese characters do
not match it well. So the simple folk, if they
want to communicate, often cannot do so.
This has saddened me, and thus I have
created twenty eight letters. I wish that
people should learn the letters so that they
can conveniently use them every day.
King Sejong the Great(Chosun Dynasty, 1446)
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 40
Hangul symbols
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
21/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 41
Design principles of Hangul
For consonants based onthe position of articulation
Vowels made use of thebasic elements earth(horizontal line) and
humankind (verticalline)
k looks like
the tongue rootclosing thethroat
u as in themiddle sound ofjun.
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 42
Design principles of Hangul
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
22/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 43
Summary
Writing systems represent language in avariety of different ways
But all writing systems represent sound tosome degree
While syllabaries are indeed common,
virtually all syllabaries require someanalysis below the syllable level
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 44
Encodings: Unicode
Character encodings are arranged intoplanes
A plane consist of 65,536 (1000016) code
points There are 17 planes (0-16) with Plane 0 being
the Basic Multilingual Plane
Texts are encoded in logical order, whichis more abstract than the presentationorder
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
23/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 45
Types of code points
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 46
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
24/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 47
Example: Devanagari Code Points
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 48
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
25/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 49
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 50
Example of Logical Ordering: Tamil /hoo/
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
26/73
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
27/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 53
Unicode encoding schemes
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 54
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
28/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 55
Issues with Unicode
The design principles are nice, but they areinconsistently applied: In Brahmi-derived alphasyllabaries each consonant
and vowel has a separate code point. Not so in Ethiopic
In Indian alphasyllabaries, logical order is strictlyenforced
Not so in Thai and Lao As we saw in the Tamil example, Unicode allowsfor variants for encoding the same information
The term ideograph should never have becomeenshrined as the term for Chinese characters
Part II
Transcription (Transliteration)
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
29/73
When Languages Collide
At the border crossing (before writing):
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 57
W UH T ZY ERN EY M ?
AA KH M EH DH
AE K M EH D ?N OW,
AA KH M EH DH
OW K EY, W IY LJH UH S T
K AH LY UW
AE K M EH D
Phonemic transfer
Twospokenforms
When Languages Collide
At the border crossing (after writing):
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 58
I need totype yourname.
Heres mypassport.
Whats this say?Its a bunch ofsquiggly lines.
AA KH M EH DH
AE K M EH D ?Argh ...Fine.
Ackmed.
Textual transfer
Twowrittenforms
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
30/73
When Languages Collide
Japanese/English example:
KEVIN KNIGHT English writing
K EH V IH N N AY T English sounds
K E B I N N A I T O Japanese sounds
Japanese writing
V B: phoneme inventory mismatch T T O: phonotactic constraint
alphabetic vs. syllabic writing
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 59
When Languages Collide
Common translation problem
People and place names
New technical terms, borrowings
Challenging when source and target
languages have:
different phoneme inventories
different phonotactic constraints
different writing systems
English, Japanese, Russian, Chinese,Arabic, Greek
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 60
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
31/73
Streets of Tokyo / Katakana
Forward vs BackwardTranscription
Forward transcription
Import foreign term / name
Newt Gingrich may be several ways to
transcribe into Arabic Generally flexible
Backward transcription
Recover original term / name
Usually only one right answer
Newt Gingrich (not Newt Kinkridge)
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 62
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
32/73
Japanese News
121512325673
75324655
9
221332218
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 63
taigaa uzzu
chyado kyanberu
kenii perii
iibunpaa
Chinese/English
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 64
Whats myname inJapanese?
KEBIN.NAITO
Whats myname inChinese?
Great! I woulddo it like this
No! Moreappealinglike this
Thats good,but this writtencharacter hasa more pleasingconnotation
Hi, what are you guys doing?I brought chips and soda
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
33/73
Chinese
Several hundred syllables in inventory Must stick to this idiosyncratic set
Washington Hua Sheng Dun
No other syllables are easily written
Homophony: after we decide on syllables,
many characters to choose from Washington Hua Sheng Dun
Transcription vs Translation
Kevin Knight Nai Kai Wen or Wu Kai WenKnight/Sproat Writing Sys tems, Translit eration and Decipherment 65
Translation versus Transcription
Sometimes things are translated insteadof transcribed Japanese: computer
(konpyuutaa) Chinese: computer
(dian nao) (electric brain)
Arabic: Southern California (Janoub Kalyfornya)
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 66
transliterated translated
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
34/73
An Interesting Case:Whats Going On Here?
Observed English/Japanese transcription:
Tonya Harding toonya haadingu
Tanya Harding taanya haadingu
Perhaps transcription is sensitive tosource-language orthography
Or perhaps the transcriber is mentallymis-pronouncing the source-languageword
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 67
A Model of Transcription
KEVIN KNIGHT English writing
K EH V IH N N AY T English sounds
K E B I N N A I T O Japanese sounds
Japanese writing
Suppose we believe these are the steps.
We can model each step with a weighted finite-state transducer (WFST), and employ ClaudeShannons noisy-channel model.
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 68
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
35/73
A Model of Transcription
Angela Knight
WFST B
WFSA A
WFST D
AE N J EH L UH N AY T
WFST C
a n j i r a n a i t o
[Knight & Graehl 98]
MODELING
DIRECTION
DECODING
DIRECTION
A Model of Transcription
Angela Knight
WFST B
WFSA A
WFST D
AE N J EH L UH N AY T
WFST C
a n j i r a n a i t o
[Knight & Graehl 98]
SPELLING
TO SOUND
TRANSDUCER
SOUND TO
SPELLING
TRANSDUCER
PHONEMIC
TRANSFER
TRANSDUCER
LANGUAGE
MODEL
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
36/73
Spelling to Sound Transducer
Richard talked about writing systems. Such a system captures an infinite relation of
pairs.
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 71
CAT : : K
: AE
: TSCAT :
: S
:
WFSTwords soundssounds words
Learning SequenceTransformation Probabilities
Ideal training data:
P(n | M) = 0.5P(m u | M) = 0.5
need much more data,of course
Actual training data:
Automatically align string pairs using the unsupervised
Expectation-Maximization (EM) algorithm.
etc
etc
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
37/73
L r 0.621
r u 0.362
AH a 0.486o 0.169
e 0.134
i 0.111
u 0.076
English-Japanesephonemictransfer patterns
learned fromparallelsequences
Learned byEM algorithm
[Knight & Graehl 98]
WFST
WFST B
WFST D
WFST C
a n j i r a n a i t o
AE N J IH R UH N AY T
AH N J IH L UH N AY T OH+ millions more
+ millions more
+ millions more
DECODING
WFSA A
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
38/73
A Model of Transcription
Angela Knight
WFST B
WFSA A
WFST D
AE N J EH L UH N AY T
WFST C
a n j i r a n a i t o
Can this transformation
be learned from
non-parallel data?
I.e., can katakana be
deciphered without
parallel text?
Well return to this
later
Decipherment section
Intermission
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
39/73
Alternative: Mapping CharacterSequences Directly
KEVIN KNIGHT English writing
KE VI N KN IGH T English letter chunks
Japanese writing
Dispenses with spelling-to-sound modelsand pronunciation dictionaries
Can be learned from parallel data usingstatistical MT-like techniques (overcharacters instead of words)
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 77
Hybrid Mapping Models
Sound-based and character-basedmethods can be combined
[Al-Onaizan & Knight 02]
[Bilac & Tanaka 04, 05]
[Oh & Choi 2005, Oh et al 06]
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 78
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
40/73
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
41/73
Use of Transcription inMachine Translation Systems
Another approach [Hermjakob et al 08]
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 81
Bilingual
Training
Corpus
Transliteration Model
Bilingual corpus, eachside with transliterateditems identified & marked
Source side only, withtransliterated items marked(throw away target side)
Trained monolingual transliterate metagger (doesnt just tag names!)
Test
CorpusTagged testcorpus
New suggestedphrasal translations(not mandatory use)
MT system
Other Uses ofTranscription Models
Cross-lingual Information retrieval, eg, [Gao et al 04]
Recognize transcriptions in comparable corpora, eg, [Sproat et al 06]
Regional studies, eg, [Kuo et al 09]
Automatic speech recognition
Phonemic transfer models might adjust for non-native speakers?
Normalization of informal Internet Romanization schemes
Greek, Arabic, Russian http://www.translatum.gr/converter/greeklish-converter.htm
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 82
Cypriot Greeklish with InstantMessaging Shorthand:
ego n 3ero re pe8kia..
skeftoume skeftoume omos
tpt..
Normalized for automatic indexing ortranslation:
...
...see Greeklish, Wikipedia
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
42/73
Overview of theTransliteration/Transcription
Literature
We have only touched on what is a large literature.
http://www.cs.mu.oz.au/~skarimi/
S. Karimi, F. Scholer, A. Turpin, A Survey onMachine Transliteration Literature, (SubmittedDec 08, Review received 31 Mar 09) UnderRevision for ACM Computing Surveys.
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 83
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 84
Discriminative models
Often used in judging potentialtranscription pairs in comparable corporasince here one is merely trying to classify
the pair We will briefly review two pieces of work:
Klementiev & Roth 2006
Some results from the 2008 JHU summerworkshop
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
43/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 85
Klementiev & Roth 2006
Named entities (NEs) in onelanguage co-occur with theircounterparts in the other Hussein has similar temporal
histogram in both corpora
Different from histogram of wordRussia
NEs are often transcribed
Approach is an iterative
algorithm which exploits thesetwo observations
Given a bilingual corpus oneside of which is tagged, itdiscovers NEs in the otherlanguage
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 86
Klementiev & Roth 2006
A linear discriminative approach fortranscription model M
Use the perceptron algorithm to train M The model activation provides the score used to
select best transcriptions
Initialize M with a (small) set of transcriptions aspositive examples and non-NEs paired withrandom words from T as negative examples
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
44/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 87
Klementiev & Roth 2006
Features for the linear model M are: For a pair of NE and a candidate (ES, ET) partition Es and ET into
substrings of length 0 to n
Each feature is a pair of substrings
For example, (ES, ET) = (powell, pouel), n = 2 Es {_, p, o, w, e, l, l, po, ow, we, el, ll}
ET {_, p, o, u, e, l, po, ou, ue, el}
Feature vector is thus ((p,_), (p, a), (w, au), (el, el),(ll, el))
Use an observation that transcription tends to preservephonetic sequence to limit the number of features
E.g. disallow couplings whose starting positions are too far apart(e.g. (p, ue) in the above example).
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 88
Klementiev & Roth 2006
S TBilingual comparable corpus (S,T)
Set of Named Entities in S
Input
M
Initialization
Initialize transcription model M
Repeat
Collect candidates in T with highscore (according to current M)
For each candidate, collect time distribution
Add best temporally aligned candidate to D
D
Use D to train M
Until D stops changing
For each NE in S
D
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
45/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 89
Klementiev & Roth
Algorithm iteratively refines transcription modelwith the help of time sequence similarity scoring
Current transcription model chooses a list of candidates
Best temporally aligned candidate is used for nextround of training
Example transcription candidate lists for NE f o r s y t h for two iterations[correct is ]
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 90
Representation matters(dont simply conclude that one should build a
model based solely on orthography)
Some results from the 2008 JHU CLSPWorkshop on Multilingual Spoken TermDetection
Train a perceptron-based discriminative model on aChinese-English name dictionary with 71,548 entries(90% training, 5% held-out, 5% testing)
Compare features based on pairing: English letters with Chinese letters (pinyin) (EL-CL)
English letters with Chinese phonemes (EL-CP)
English phonemes with Chinese phonemes (EP-CP)Pinyin is a relatively abstract phonemic representation that isnot a particularly accurate representation of the pronunciation
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
46/73
Part IIIDecipherment
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 92
Thomas Young
Jean FranoisChampollion
Henry CreswickeRawlinson
Georg FriederichGrotefend
Michael Ventris
Some decipherers
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
47/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 93
Not everything is decipherable
The Phaistos Disk:
Most serious scholars think the text is too short
A recent find from Jiroft (Iran)
Many suspect this Is a fake
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 94
Symbols for the major deities of Aurnairpal II
Not everything that consists oflinearly arranged symbols is writing
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
48/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 95
Not every communicative symbolsystem is writing
Naxi text
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 96
Questions that have to be asked
Is the artifact genuine?
Is the symbol system linguistic or non-linguistic?
If you have bilingual text that can help answerthe question
What is the underlying language?
Which direction was the text read in?
What kind of writing system are we dealingwith?
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
49/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 97
Techniques and issues
Bilingual texts; names
Structural analysis
Verification
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 98
Parallel and comparable texts in Egyptian(Young & Champollion, 1816 onwards)
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
50/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 99
Parallel and comparable texts inEgyptian
p
t w
l
m
syy
w
p t
l y
k 3 3t
r
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 100
Parallel text --- without parallel text
In early September 2008, many people werefocussed on Hurricane Gustav, and whatdamage it might inflict upon the US oil industry in
the Gulf of Mexico, or on the city of NewOrleans
If you looked in Chinese newspapers at that timeyoud find mention of (gstf)
Proper names are often an implicit source ofparallel text
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
51/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 101
Grotefends (1800) decipherment ofOld Persian
Grotefend expected to find the namesDarius and Xerxes in an inscription fromPersepolis
Grotefend guessed that vertical bars in theinscription were word separators
By the large number of symbols between theseparators he reasoned that the system mustbe alphabet. (Actually that turned out to bewrong.)
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 102
Grotefends decipherment of OldPersian
From later Persian (Avestan) texts a few things wereknown: Kings were designated in a very formulaic way: X, great king,
king of kings son of Y
Xerxes and Dariuss names were something like xhere and
darheu The later word for king was keio
From history it was known that Xerxes was the son ofDarius, and Darius the son of Hystapes (who was not aking)
Grotefend reasoned the inscriptions might be: Xerxes great King son of Darius
Darius great King son of Hystapes
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
52/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 103
A, great king, son of BB, great king son of C
xhere
darheu
darheu
hystapes
keio
keio
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 104
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
53/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 105
Structural analysis: Linear B(Michael Ventris, early 1950s)
Kobers triplets:
Ventris grid
ru ki toru ki ti joru ki ti jaLuktos
a mi ni soa mi ni si joa mi ni si jaAmnisos
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 106
Verification: Linear B
The phonology of many wordscorresponded to what was suspected forGreek from the relevant period:
wa-na-ka (*wanaks, later anax `ruler)i-qo (*iqqwos, later hippos `horse)
No definite articles
Confirmation from new finds by CarlBlegen: ti ri po de qe to ro we qwetrwes
tetr-
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
54/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 107
Verification: Babylonian
Babylonian is a complex mixed script. The decipherment by Henry Creswicke Rawlinson and others
seemed so arcane that many people doubted the decipherment
In 1857 the Royal Asiatic Society received a letter fromW.H. Fox Talbot containing a sealed translation of a textfrom the reign of Tiglath Pileser I (Middle Assyrianperiod, 11141076 BC)
Talbot proposed comparing this with Rawlinsonstranslation, which was soon to be published
Rawlinson not only agreed with this proposal, butsuggested that two further scholars Edward Hincksand Jules Oppert be asked to provide translations.
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 108
Verification: Babylonian
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
55/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 109
VerificationA text from Abu Simbel
S Sra ms
On the Rosetta Stone, (2) was found to be aligned with the Greek wordgenethlia birthday: the Coptic word for birth was mse confirming the msreading for this glyph
Tuthmosis
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 110
How complete must a decipherment befor it to be verified?
From Ventris & Chadwick, Documents in Mycenaean Greek
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
56/73
Prospects for AutomaticDecipherment
Automatic decipherment is whycomputers were invented, in the 1940s
Of course, military ciphers are differentfrom unknown scripts
But similar skills and techniques mayapply
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 111
Letter Substitution Cipher
Plaintext: HELLO WORLD ...
Secret encipherment key:PLAI N: ABCDEFGHI J KLMNOPQRSTUVWXYZ
CI PHER: PLOKMI J NUHBYGVTFCRDXESZAQW Ciphertext: NMYYT ZTRYK ...
Key is unknown to code-breaker
What key, if applied to the ciphertext, would
yield sensible plaintext?
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
57/73
. . . .
KDCY LQZKTLJQX CY MDBCYJQL: TR
. . . . . .HYD FKXC, FQ MKX RLQQIQ HYDL
. . . .
MKL DXCTW RDCDLQ JQMNKXTMB
. . . . .
PTBMYEQL K FKH CY LQZKTL TC.
AB 3C 8
D 7 #E 1 .F 3 .GH 3 .I 1 .
J 3 .K 9 ##### VL 10 ##M 6 #N 1 .OP 1 .Q 11 ######### V
R 3 .S
T 7 ### VUVW 1 .X 5
Y 7 #### VZ 2 .
a o e. a . e o o. e .KDCY LQZKTLJQX CY MDBCYJQL: TR
. o . a . e a . ee. e . o
HYD FKXC, FQ MKX RLQQIQ HYDL
a . . e . e . a
MKL DXCTW RDCDLQ JQMNKXTMB
. o. e a . a. o e. a
PTBMYEQL K FKH CY LQZKTL TC.
AB 3C 8D 7 #E 1 .F 3 .GH 3 .I 1 .
J 3 .K 9 ##### V
L 10 ##M 6 #N 1 .OP 1 .Q 11 ######### VR 3 .S
T 7 ### VUVW 1 .X 5
Y 6 #### V
Z 2 .
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
58/73
aut o r epai r men t o cust omer i f
KDCY LQZKTLJQX CY MDBCYJQL: TR
you wai t we can f r eeze yourHYD FKXC, FQ MKX RLQQIQ HYDL
car unt i l f ut ur e mechani cs
MKL DXCTW RDCDLQ JQMNKXTMB
di scover a way t o r epai r i t
PTBMYEQL K FKH CY LQZKTL TC.
AB 3C 8
D 7 #E 1 .F 3 .GH 3 .I 1 .
J 3 .K 9 ##### VL 10 ##M 6 #N 1 .OP 1 .Q 11 ######### V
R 3 .S
T 7 ### VUVW 1 .X 5
Y 6 #### VZ 2 .
Letter Substitution Cipher
How little knowledge of the plaintextlanguage is necessary for decipherment?
Simple letter-based n-gram models
P(a | t) -- given t, chance that next letter is a
EM-based decipherment
[Knight et al 06]
Integer-programming-based decipherment
[Ravi & Knight 08]
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 116
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
59/73
Letter Substitution Cipher
[Ravi & Knight 08]
3-gram letter-basedlanguage model
of English used fordecipherment
Unknown Script as a Cipher
ciphertext
(Linear B tablet)
Greeksounds
ciphertext
(Mayan writing)
ModernMayansounds
make the text speak
?
?
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
60/73
Unknown Script as a Cipher
ciphertext (6980 letters)
primera parte
del ingenioso
hidalgo don
(Don Quixote)
ModernSpanishsounds
?
26 sounds:
B, D, G, J (canyon),
L (yarn), T (thin), a,b, d, e, f, g, i, k, l,m, n, o, p , r,rr (trilled), s,t, tS, u, x (hat)
32 letters:
, , , , , ,
a, b, c, d, e, f, g,h, i, j, k, l, m, n,o, p, q, r, s, t, uv, w, x, y, z
?
[Knight & Yamada, 1999]
?
Unknown Script as a Cipher
ciphertext (6980 letters)
primera parte
del ingenioso
hidalgo don
(Don Quixote)
ModernSpanishsounds
P(c | p) =P(c1 | p1) *
P(c2 | p2) *P(c3 | p3) *
Phoneme-to-letter modelP(y | L) = 0.8 ?
P(p) =P(p1 | START) *
P(p2 | p1) *P(p3 | p2) *
Phonemebigram modelP(L | tS) = 0.003
??
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
61/73
Ideal Key
B b or vD d
G g
J
L l l or y
a a or
b b or v
d d
e e or
f f
g g
i i or
l l
m m
n n
o o or
p p
r r
t t
tS c h
u u or
x j
nothing h
T (before a, o, u) z
T (before e or I) c or z
T (otherwise) c
k (before e or I) q u
k (before s) x
k (otherwise) c
rr (start of word) r
rr (otherwise) rr
s (after k) nothing
s (otherwise) s
sound letter sound letter
Unknown Script as a Cipher
ciphertext (6980 letters)
primera parte
del ingenioso
hidalgo don
(Don Quixote)
ModernSpanishsounds
?
EM-based decipherment finds a very good key and achieves 93%phoneme accuracy
Correct sounds: primera parte del inxenioso iDalGo don kixoteDeciphered sounds:primera parte del inGeniosobiDalGo don kixote
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
62/73
How to Decipher Unknown Script ifSpoken Language is Also Unknown?
One idea: build a universal model P(s) of humanphoneme sequence production
Human might generally say: K AH N AH R IY
Human wont generally say: R T R K L K
Deciphering means finding a P(c | p) table such thatthere is a decoding with a good universal P(p) score
Universal Phonology
Linguists know lots of stuff!
Phoneme inventory if z, then s
Syllable inventory
all languages have CV (consonant-vowel) syllables if VCC, then also VC
Syllable sonority structure {stdbptk} {mnrl} {V} {mnrl} {stdbptk}
dram, lomp, tra, ma, ? rdam, ? lopm, ? tba, ? mla
Physiological preference constraints tomp, tont, tongk, ? tomk, ? tonk, ? tongt, ? tonp
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
63/73
Universal Phonology
primera parte
del ingenioso
hidalgo don
humansoundingsequence
??
Task 1: Label each letter with a phoneme
syllable
type
sequence
consonant/
vowel
sequences
# of
syllables
in word
Universal Phonology
primera parte
del ingenioso
hidalgo don
Task 2: Label each letter with a phoneme class: C or V
Input: primera parte del ingenioso hidalgo don Output: CCVCVCV CVCCV CVC VCCVCVVCV CVCVCCV CVC
P(1) = ?
P(2) = ?etc.
P(CV) = ?
P(V) = ?P(CVC) = ?+ 7 other types
P(V | V) = ?P(VV | V) = ?
P(a | V) = ?P(a | C) = ?etc.
P(CV) = 0.45 P(VC) = 0.09
P(V) = 0.15 P(CVC) = 0.22
P(CCV) = 0.02 P(CCVC) = 0.01
P(a | V) = 0.27 P(a | C) = 0.00
P(b | V) = 0.00 P(b | C) = 0.04
P(c | V) = 0.00 P(c | C) = 0.07
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
64/73
Another idea: brute force
If we dont know the spoken language,simply decode against all spokenlanguages: Pre-collect P(p) for 300 languages
Train a P(c | p) using each P(p) in turn
See which decoding run assigns highest P(c)
Hard to get phoneme sequences
Can use text sequence as a substitute
Unknown Source Language
UN Declaration of HumanRights
No one shall be arbitrarily deprived of his property
Niemand se eiendom sal arbitrr afgeneem word nie
Asnjeri nuk duhet t privohet arbitrarisht nga pasuria e tij
Janiw khitisa utaps oraqeps inaki aparkaspati
Arrazoirik gabe ez zaio inori bere jabegoa kenduko
Den ebet ne vo tennet e berc'hentiezh diganta diouzh c'hoant
H
Ning no ser privat arbitrriament de la seva propietat
Di a so prupiit n ni p essa privu nimu di modu tirannicu
Nitko ne smije samovoljno biti lien svoje imovine
Nikdo nesm bt svvoln zbaven svho majetku
Ingen m vilkrligt berves sin ejendom
Niemand mag willekeurig van zijn eigendom worden beroofd
Nul ne peut tre arbitrairement priv de sa proprit
Nimmenmei samar fan syn eigendom berve wurde
Ningun ser privado arbitrariamente da sa propiedade
Niemand darf willkrlich seines Eigentums beraubt werden
Avavgui ndojepe'a va'eri oimehicha reinte imbe teva
Ba wanda za a kwace wa dukiyarsa ba tare da cikakken dalili ba
Senkit sem lehet tulajdontl nknyesen megfosztani
Engan m eftir getta svipta eign sinni
Necuno essera private arbitrarimente de su proprietate
N fidir a mhaoin a bhaint go forlmhach de dhuine ar bith
Al neniu estu arbitre forprenita lia proprieto
Kelleltki ei tohi tema vara meelevaldselt ra vtta
Eingin skal hissini vera fyri ongartku
Me kua ni dua e kovei vua na nona iyau
Keltn lkn mielivaltaisesti riistettk hnen omaisuuttaan
Exists in many of worlds languages, UTF-8 encoding
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
65/73
Unknown Source Language
Input:cevzr en cnegr qry vat r avbf b uvqnyt b qba dhvwbgr qr yn znapun
Top 5 languages with best P(c) after deciphering:-5.29120 spanish
-5.43346 galician
-5.44087 portuguese
-5.48023 kurdish
-5.49751 romanian
Best-path decoding assuming plaintext is Spanish:pr i mera par t e del i ngeni oso hi dal go don qui j ote de l a mancha
Best-path decoding assuming plaintext is English:wi zar i s asi ve bek u- gedundl pubscon bl y whual ve be ks asequs
Simultaneous language ID and decipherment
Transliteration as a Cipher
Ciphertext: Japanese Katakana
Plaintext: English
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 130
[Ravi & Knight 09]
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
66/73
Foreign Language as a Cipher?
Ciphertext: Billions of words of Albanian
Plaintext: English
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 131
Is it possible to train statistical MT systemswith little or no parallel text?
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 132
Whats left to decipher?
Proto-Elamite
Linear A
Etruscan
rongorongo
Indus Valley
Phaistos disk
Epi-Olmec and other Mesoamericanscripts
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
67/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 133
Linear A(Crete, ca 2000 BC to 1200 BC)
Clearly the precursor ofLinear B
Mostly accounting texts(like Linear B), thoughthere are other kinds ofinscriptions
We can read the textsbut we dont knowmuch about theunderlying language.
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 134
Etruscan(Italy, 700 BC 1st Century AD)
The alphabet is known it was derived fromGreek and was the
precursor to Latin The language (like that
of Linear A) is largelyunknown
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
68/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 135
Proto-Elamite(Iran, ca. 3100 2900 BC)
Possibly as many as5,500 distinct signs (?)
Underlying language isunknown may beElamite (cf later linearElamite inscriptions)
but that is not clear
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 136
rongorongo(Easter Island 19th Century)
About 600 zoomorphic andanthropomorphic glyphs
Extant corpus is about 12,000 glyphslong, all carved on driftwood
The underlying language (Rapanui) isknown
Ethnographic accounts of therongorongo ceremonies exist
Claims to the contrary aside, there isno evidence this was a writing systemin the normal sense. The only bit of text that has been
deciphered is a calendar
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
69/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 137
Indus Valley(South Asia, 26th20th century BC) System with a few
hundred glyphs Inscriptions are very short
longest on a singlesurface has 17 glyphs
The standard theory,due to Asko Parpola, isthat this was a Dravidianlanguage
Recently, Farmer, Witzeland Sproat argued thatthis was not a writing-system at all
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 138
Phaistos disk(Crete, ca 1800 BC??) 241 tokens with 45 distinct
glyphs Glyphs are all pictographic
images of animals, people,various objects
Text is on both sides of disk ina spiral working from the
outside The Phaistos Disk is the
worlds first known printeddocument
There has been a recentsuggestion (by ancient artdealer Jerome Eisenberg) thatit may be a fake
In any case, the text is tooshort to allow for a verifiabledecipherment
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
70/73
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
71/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 141
Further Reading
Encoding: there are many documents on theweb that discuss encoding issues,including various documents from theUnicode Consortium.
However, one of the best starting places is:http://www.joelonsoftware.com/articles/Unicode.html
Further Reading
Transliteration/Transcription
http://www.cs.mu.oz.au/~skarimi/
S. Karimi, F. Scholer, A. Turpin, A Survey onMachine Transliteration Literature, (SubmittedDec 08, Review received 31 Mar 09) UnderRevision for ACM Computing Surveys.
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 142
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
72/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 143
Further Reading
Discriminative models of transcription
1. A. Klementiev and D. Roth. 2006. Weakly supervisednamed entity transliteration and discovery frommultilingual comparable corpora. InACL.
2. D. Zelenko and C. Aone. 2006. Discriminative methodsfor transliteration. In EMNLP.
3. S-Y. Yoon, K-Y. Kim, and R. Sproat. 2007. Multilingual
transliteration using feature based phonetic method. InACL.
4. D. Goldwasser and D. Roth. 2008. Active sampleselection for named entity transliteration. InACL.
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 144
Further Reading
Decipherment
1. R. Parkinson. 1999. Cracking Codes: The Rosetta Stone andDecipherment. University of California Press, Berkeley.
2. M. Pope. 1999.The Story of Decipherment: From Egyptian
Hieroglyphs to Maya Script. Thames and Hudson, London.
3. A. Robinson. 2002. The Man who Deciphered Linear B: TheStory of Michael Ventris. Thames and Hudson, London.
4. A. Robinson. 2009. Lost Languages: The Enigma of theWorlds Undeciphered Scripts. Thames and Hudson, London.
8/14/2019 Kevin Knight on Writing Systems Naacl09-Print-1x2
73/73
Knight/Sproat Writing Sys tems, Translit eration and Decipherment 145
Further Reading
Auto Decipherment
1. A Computational Approach to Deciphering Unknown Scripts,(K. Knight and K. Yamada), Proceedings of the ACL Workshopon Unsupervised Learning in Natural Language Processing,1999.
2. Unsupervised Analysis for Decipherment Problems", (K.Knight, A. Nair, N. Rathod, and K. Yamada), Proc. ACL-COLING (poster), 2006.
3. Attacking Decipherment Problems Optimally with Low-OrderN-gram Models, (S. Ravi and K. Knight), Proc. EMNLP, 2008.
4. Learning Phoneme Mappings for Transliteration withoutParallel Data, (S. Ravi and K. Knight), Proc. NAACL, 2009.
the end