Lost Language Decipheration

Kovid Kapoor - 08005037Aashimi Bhatia – 08D04008Ravinder Singh – 08005018

Shaunak Chhaparia – 07005019

Lost Language Decipheration

Examples of ancient languages which were lost

Motivation : Why should we bother about such languages?

The Manual process of DecipherationMotivation for a Computational ModelA Statistical Method for DecipherationConclusions

Outline

A language is said to be “lost” when modern scholars cannot reconstruct text written in it.Slightly different from a “dead” language – a

language which people can translate to/from, but noone uses it anymore in everyday life.

Generally happens when one language gets replaced by another.

For eg, native American languages were replaced by English, Spanish etc.

What is a "lost" language

Egyptian HieroglyphsA formal writing system used by ancient

Egyptians, containing of logographic and alphabetic symbols.

Finally deciphered in the early 19th century, following a lucky finding of “Rosetta Stone”.

Ugaritic LanguageTablets with engravings found in the lost city of

Ugarit, Syria.Researchers recognized that it is related to

Hebrew, and could identify some parallel words.

Examples of Lost Languages

Indus ScriptWritten in and around Pakistan around 2500

BCOver 4000 samples of the text have been

found.Still not deciphered successfully!What makes it difficult to decipher?

Examples of Lost Languages (cont.)

http://en.wikipedia.org/wiki/File:Indus_seal_impression.jpg

http://en.wikipedia.org/wiki/File:Indus_seal_impression.jpg

Historical knowledge expansionVery helpful in learning about the history of the

place where the language was written.Alternate sources of information : coins,

drawings, buried tombs.These sources not as precise as reading the

literature of the region, which gives a clear idea.Learning about the past explains the present

A lot of the culture of a place is derived from ancient cultures.

Boosts our understanding of our own culture.

Motivation for Decipheration of Lost Languages

From a linguistic point of viewWe can figure out how certain languages were

developed through time.Origin of some of the words explained.

Motivation for Decipheration of Lost Languages(cont.)

Similar to a cryptographic decryption processFrequency analysis based techniques usedFirst step : identify the writing system

Logographic, alphabetic or syllaberies?Usually determined by the number of distinct

symbols.Identify if there is a closely related known

languageHope for finding bitexts : translations of a text of

the language in a known language, like Latin, Hebrew etc.

The Manual Process

http://www.straightdope.com/columns/read/2206/how-come-we-cant-decipher-the-indus-script



Earliest attempt made by Horapollo in the 5th century.However, explanations were mostly wrong!Proved to be an impediment on the process for

1000 years!Arab historians able to partly decipher in the

9th and 10th centuries.Major Breakthrough : Discovery of Rosetta

Stone, by Napolean’s troops.

Examples of Manual Decipherment : Egyptian Hieroglyphs




The stone has a decree issued by the king in three languages : hieroglyphs, demotic, and ancient Greek!

Finally deciphered in 1820 by Jean-François Champollion.

Note that even with the availability of a bitext, full decipheration took 20 more years!

Examples of Manual Decipherment : Egyptian Hieroglyphs

http://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Rosetta_Stone_BW.jpeg/200px-Rosetta_Stone_BW.jpeg




The inscribed words consisted of only 30 distinct symbols.Very likely to be alphabetical.

The location of the tablets found suggested that it is closely related to Semitic languages

Some words in Ugaritic had the same origin as words in HebrewFor eg, the Ugaritic word for king is the same

as the Hebrew word.

Examples of Manual Decipherment : Ugaritic




Lucky discovery : Hans Bauer assumed that the writings on an axe found was the word “axe”!

Led to revision of some earlier hypothesis, and resulted in decipherment of the entire script!

Examples of Manual Decipherment : Ugaritic (cont.)

http://knp.prs.heacademy.ac.uk/images/cuneiformrevealed/scripts/ugaritic.jpg



Very time taking exercise; years, even centuries taken for the successful decipherment.

Even when some basic information about the language is learnt, like the syntax structure, a closely related languages, long time required to produce character and word mappings.

Conclusions on the Manual Process

Once some knowledge about the language has been learnt, is it possible to use a program to produce word mappings?

Can the knowledge of a closely related language be used to decipher a lost language?

If possible, would save a lot of efforts and time.Successful archaeological decipherment has

turned out to require a synthesis of logic and intuition…that computers do not (and presumably cannot) possess. – Andrew Robinson

Need for a Computerised Model

Notice that manual efforts have some guiding principles A common starting point is to compare letter

and word frequencies with a known languageMorphological analysis plays a crucial role as

wellHighly frequent morpheme correspondences

can be particularly revealing.The model tries to capture these letter/word

level mappings and morpheme correspondences.

Recent attempts : A Statistical model

http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf


We are given a corpus in the lost language, and a non-parallel corpus in a related language from the same family.

Our primary goals : Finding the mapping between the alphabets of

the lost and known language.Translate words in the lost language into

corresponding cognates of the known languages

Problem Formulation



We make several assumptions in this model :That the writing system is alphabetic in nature

Can be easily verified by counting the number of symbols in the found record.

That the corpus has been transcribed into an electronic formatMeans that each character is uniquely identified.

About the morphology of the language : Each word consists of a stem, prefix and suffix,

where the latter two may be omittedHolds true for a large variety of human languages

Problem Formulation

The inventories and the frequencies in the known language are given.

In essence, the input consists of two parts :A list of unanalyzed words in a lost languageA morphologically analyzed syntax in a known

related language

Problem Formulation

Consider the following example, consisting of words in a lost language closely related to English, but written using numerals.15234 --asked1525 --- asks4352 --- desk

Notice the pair of endings, -34 and -5, with the same initial sequence 152-Might correspond to –ed and –s respectively.Thus, 3=e, 4=d and 5=s

Intuition : A toy example

Now, we can say that 435=des, and using our knowledge of English, we can suppose that this word is very likely to be desk.

As this example illustrates, we proceed by discovering both character- and morpheme-level mappings.

Another intuition the model should capture is the sparsity of the mapping.Correct mapping will preserve phonetic

relations b/w the two related languagesEach character in the unknown language will

map to a small number of characters in the related language.

Intuition : A toy example

We assume that each morpheme is probabilistically generated jointly with a latent counterpart in the lost language

The challenge: Each level of correspondence can completely describe the observed data. So using a mechanism based on one leaves no room for the other.

The solution: Using a Dirichlet Process to model probabilities (explained further).

Model Structure



There are four basic layers in the generative processStructural SparsityCharacter-edit DistributionMorpheme-pair DistributionsWord Generation

Model Structure (cont…)

Model Structure (cont…)

Graphical overview of the Modelhttp://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf


We need a control on the sparsity of the edit-operation probabilities, encoding the linguistic intuition that character-level mapping should be sparse.

The set of edit operations include character substitutions, insertions and deletions. We assign a variable λe corresponding to every edit operation e.

The set of character correspondences with the variable set to 1 { (u,h) : λ(u,h) = 1 }conveys a set of phonetically valid correspondences.

We define a joint prior over these variables to encourage sparse character mappings.

Step 1 : Structural Sparsity

This prior can be viewed as a distribution over binary matrices and is defined to encourage every row and column to sum to low values integer values (typically 1)

For a given matrix, define a count c(u) which is the number of corresponding letters that u has in that matrix.Formally, c(u) = ∑h λ(u,h)

We now define a function fi = max(0, |{u : c(u) = i}| - bi)For any i other than 1, fi should be as low as possible.

Now the probability of this matrix is given by

Step 1 : Structural Sparsity (cont.)

Here Z is the normalization factor and w is the weight vector.

wi is either zero or negative, to ensure that the probability is high for a low value of f.

The values of bi and wi can be adjusted depending on the number of characters in the lost language and the related language.

Step 1 : Structural Sparsity (cont…)

We now draw a base distribution G0 over character edit sequences.

The probability of a given edit sequence P(e) depends on the value of the indicator variable of individual edit operations λe, and a function depending on the number of insertions and deletions in the sequence, q(#ins(e), #del(e)).

The factor depending on the number of insertions and deletions depends on the average word lengths of the lost language and the related language.

Step 2 : Character-Edit Distribution

Example: Average Ugaritic word is 2 letters longer than an average Herbew word

Therefore, we set our q to be such as to disallow any deletions and allow 1 insertion per sequence, with the probability of 0.4

The part depending on the λes makes the distribution spike at 0 if the value is 0 and keeps it unconstrained otherwise (spike-and slab priors)

Step 2 :Character-Edit Distribution (cont.)

The base distribution G0 along with a fixed parameter α define a Dirichlet process, which provides probability over morpheme-pair distributions.

The resulting distributions are likely to be skewed in favor of a few frequently occurring morpheme-pairs, while remaining sensitive to character-level probabilities of the base distribution.

Our model distinguishes between the 3 kinds of morphemes- prefixes, stems and suffixes. We therefore use different values of α

Step 3 : Morpheme Pair-Distributions

Also, since the suffix and prefix depend on the part of speech of the stem, we draw a single distribution Gstm for the stem, we maintain separate distributions Gsuf|stm and Gpre|stm for each possible stem part-of-speech.

Step 3 : Morpheme Pair-Distributions (cont.)

Once the morpheme-pair distributions have been drawn, actual word pairs may now be generated.

Based on some prior, we first decide if a word in the lost language has a cognate in the known language.

If it does, then a cognate word pair (u, h) is produced:

Otherwise, a lone word u is generated.

Step 4 : Word Generation

This model captures both character and lexical level correspondences, while utilizing morphological knowledge of the known language.

An additional feature of this multi-layered model structure is that each distribution over morpheme pairs is derived from the single character-level base distribution G0.

As a result, any character-level mappings learned from one correspondence will be propagated to other morpheme distributions.

Also, the character-level mappings obey sparsity constraints

Summarizing the Model

Applied on Ugaritic languageUndeciphered corpus contains 7,386 unique

word types.The Hebrew Bible used for known language

corpus, which is close to ancient Ugaritic.Assume morphological and POS annotations

availability for the Hebrew lexicon.

Results of the process



The method identifies Hebrew cognates for 2,155 words, covering almost 1/3rd of the Ugaritic vocabulary.

The baseline method correctly maps 22 out of 30 characters to their Hebrew counterparts, and translates only 29% of all the cognates

This method correctly translates 60.4 % of all cognates.

This method yields correct mapping for 29 out of 30 characters.

Results of the process

Even with character mappings, many words can be correctly translated only by examining their context.

The model currently fails to take the contextual information into account.

Future Work



We saw how language decipherment is an extremely complex task.

Years of efforts required for successful decipheration of each lost language.

Depends on the amount of available corpus in the unknown language.But availability does not make it easy.

Statistical model has shown promise.Can be developed further and used for

more languages.

Conclusions

Wikipedia article on Decipherment of Hieroglyphs http://en.wikipedia.org/wiki/Decipherment_of_hieroglyphic_writing

Lost Languages: The Enigma of the World's Undeciphered Scripts by Andrew Robinson (2009) http://entertainment.timesonline.co.uk/tol/arts_and_entertainment/books/non-fiction/article5859173.ece

A Statistical Model for Lost Language Decipherment Benjamin Snyder, Regina Barzilay, and Kevin Knight ACL (2010) (http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf)

References

http://en.wikipedia.org/wiki/Decipherment_of_hieroglyphic_writing




http://entertainment.timesonline.co.uk/tol/arts_and_entertainment/books/non-fiction/article5859173.ece






A staff talk from Straight Dope Science Advisory Board – How come we can’t decipher the Indus Script? (2005) http://www.straightdope.com/columns/read/2206/how-come-we-cant-decipher-the-indus-script

Wade Davis on Endangered Cultures (2008) http://www.ted.com/talks/wade_davis_on_endangered_cultures.html

References




http://www.ted.com/talks/wade_davis_on_endangered_cultures.html



Documents

Lost Language Decipheration