38
Kovid Kapoor - 08005037 Aashimi Bhatia – 08D04008 Ravinder Singh – 08005018 Shaunak Chhaparia – 07005019 Lost Language Decipheration

Lost Language Decipheration

  • Upload
    osma

  • View
    62

  • Download
    1

Embed Size (px)

DESCRIPTION

Lost Language Decipheration. Kovid Kapoor - 08005037 Aashimi Bhatia – 08D04008 Ravinder Singh – 08005018 Shaunak Chhaparia – 07005019. Outline. Examples of ancient languages which were lost Motivation : Why should we bother about such languages? The Manual process of Decipheration - PowerPoint PPT Presentation

Citation preview

Page 1: Lost Language Decipheration

Kovid Kapoor - 08005037Aashimi Bhatia – 08D04008Ravinder Singh – 08005018

Shaunak Chhaparia – 07005019

Lost Language Decipheration

Page 2: Lost Language Decipheration

Examples of ancient languages which were lost

Motivation : Why should we bother about such languages?

The Manual process of DecipherationMotivation for a Computational ModelA Statistical Method for DecipherationConclusions

Outline

Page 3: Lost Language Decipheration

A language is said to be “lost” when modern scholars cannot reconstruct text written in it.Slightly different from a “dead” language – a

language which people can translate to/from, but noone uses it anymore in everyday life.

Generally happens when one language gets replaced by another.

For eg, native American languages were replaced by English, Spanish etc.

What is a "lost" language

Page 4: Lost Language Decipheration

Egyptian HieroglyphsA formal writing system used by ancient

Egyptians, containing of logographic and alphabetic symbols.

Finally deciphered in the early 19th century, following a lucky finding of “Rosetta Stone”.

Ugaritic LanguageTablets with engravings found in the lost city of

Ugarit, Syria.Researchers recognized that it is related to

Hebrew, and could identify some parallel words.

Examples of Lost Languages

Page 5: Lost Language Decipheration

Indus ScriptWritten in and around Pakistan around 2500

BCOver 4000 samples of the text have been

found.Still not deciphered successfully!What makes it difficult to decipher?

Examples of Lost Languages (cont.)

http://en.wikipedia.org/wiki/File:Indus_seal_impression.jpg

Page 6: Lost Language Decipheration

Historical knowledge expansionVery helpful in learning about the history of the

place where the language was written.Alternate sources of information : coins,

drawings, buried tombs.These sources not as precise as reading the

literature of the region, which gives a clear idea.Learning about the past explains the present

A lot of the culture of a place is derived from ancient cultures.

Boosts our understanding of our own culture.

Motivation for Decipheration of Lost Languages

Page 7: Lost Language Decipheration

From a linguistic point of viewWe can figure out how certain languages were

developed through time.Origin of some of the words explained.

Motivation for Decipheration of Lost Languages(cont.)

Page 8: Lost Language Decipheration

Similar to a cryptographic decryption processFrequency analysis based techniques usedFirst step : identify the writing system

Logographic, alphabetic or syllaberies?Usually determined by the number of distinct

symbols.Identify if there is a closely related known

languageHope for finding bitexts : translations of a text of

the language in a known language, like Latin, Hebrew etc.

The Manual Process

http://www.straightdope.com/columns/read/2206/how-come-we-cant-decipher-the-indus-script

Page 9: Lost Language Decipheration

Earliest attempt made by Horapollo in the 5th century.However, explanations were mostly wrong!Proved to be an impediment on the process for

1000 years!Arab historians able to partly decipher in the

9th and 10th centuries.Major Breakthrough : Discovery of Rosetta

Stone, by Napolean’s troops.

Examples of Manual Decipherment : Egyptian Hieroglyphs

http://www.straightdope.com/columns/read/2206/how-come-we-cant-decipher-the-indus-script

Page 10: Lost Language Decipheration

The stone has a decree issued by the king in three languages : hieroglyphs, demotic, and ancient Greek!

Finally deciphered in 1820 by  Jean-François Champollion.

Note that even with the availability of a bitext, full decipheration took 20 more years!

Examples of Manual Decipherment : Egyptian Hieroglyphs

http://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Rosetta_Stone_BW.jpeg/200px-Rosetta_Stone_BW.jpeg

Page 11: Lost Language Decipheration

The inscribed words consisted of only 30 distinct symbols.Very likely to be alphabetical.

The location of the tablets found suggested that it is closely related to Semitic languages

Some words in Ugaritic had the same origin as words in HebrewFor eg, the Ugaritic word for king is the same

as the Hebrew word.

Examples of Manual Decipherment : Ugaritic

http://www.straightdope.com/columns/read/2206/how-come-we-cant-decipher-the-indus-script

Page 12: Lost Language Decipheration

Lucky discovery : Hans Bauer assumed that the writings on an axe found was the word “axe”!

Led to revision of some earlier hypothesis, and resulted in decipherment of the entire script!

Examples of Manual Decipherment : Ugaritic (cont.)

http://knp.prs.heacademy.ac.uk/images/cuneiformrevealed/scripts/ugaritic.jpg

Page 13: Lost Language Decipheration

Very time taking exercise; years, even centuries taken for the successful decipherment.

Even when some basic information about the language is learnt, like the syntax structure, a closely related languages, long time required to produce character and word mappings.

Conclusions on the Manual Process

Page 14: Lost Language Decipheration

Once some knowledge about the language has been learnt, is it possible to use a program to produce word mappings?

Can the knowledge of a closely related language be used to decipher a lost language?

If possible, would save a lot of efforts and time.Successful archaeological decipherment has

turned out to require a synthesis of logic and intuition…that computers do not (and presumably cannot) possess. – Andrew Robinson

Need for a Computerised Model

Page 15: Lost Language Decipheration

Notice that manual efforts have some guiding principles A common starting point is to compare letter

and word frequencies with a known languageMorphological analysis plays a crucial role as

wellHighly frequent morpheme correspondences

can be particularly revealing.The model tries to capture these letter/word

level mappings and morpheme correspondences.

Recent attempts : A Statistical model

http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf

Page 16: Lost Language Decipheration

We are given a corpus in the lost language, and a non-parallel corpus in a related language from the same family.

Our primary goals : Finding the mapping between the alphabets of

the lost and known language.Translate words in the lost language into

corresponding cognates of the known languages

Problem Formulation

http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf

Page 17: Lost Language Decipheration

We make several assumptions in this model :That the writing system is alphabetic in nature

Can be easily verified by counting the number of symbols in the found record.

That the corpus has been transcribed into an electronic formatMeans that each character is uniquely identified.

About the morphology of the language : Each word consists of a stem, prefix and suffix,

where the latter two may be omittedHolds true for a large variety of human languages

Problem Formulation

Page 18: Lost Language Decipheration

The inventories and the frequencies in the known language are given.

In essence, the input consists of two parts :A list of unanalyzed words in a lost languageA morphologically analyzed syntax in a known

related language

Problem Formulation

Page 19: Lost Language Decipheration

Consider the following example, consisting of words in a lost language closely related to English, but written using numerals.15234 --asked1525 --- asks4352 --- desk

Notice the pair of endings, -34 and -5, with the same initial sequence 152-Might correspond to –ed and –s respectively.Thus, 3=e, 4=d and 5=s

Intuition : A toy example

Page 20: Lost Language Decipheration

Now, we can say that 435=des, and using our knowledge of English, we can suppose that this word is very likely to be desk.

As this example illustrates, we proceed by discovering both character- and morpheme-level mappings.

Another intuition the model should capture is the sparsity of the mapping.Correct mapping will preserve phonetic

relations b/w the two related languagesEach character in the unknown language will

map to a small number of characters in the related language.

Intuition : A toy example

Page 21: Lost Language Decipheration

We assume that each morpheme is probabilistically generated jointly with a latent counterpart in the lost language

The challenge: Each level of correspondence can completely describe the observed data. So using a mechanism based on one leaves no room for the other.

The solution: Using a Dirichlet Process to model probabilities (explained further).

Model Structure

http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf

Page 22: Lost Language Decipheration

There are four basic layers in the generative processStructural SparsityCharacter-edit DistributionMorpheme-pair DistributionsWord Generation

Model Structure (cont…)

Page 23: Lost Language Decipheration

Model Structure (cont…)

Graphical overview of the Modelhttp://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf

Page 24: Lost Language Decipheration

We need a control on the sparsity of the edit-operation probabilities, encoding the linguistic intuition that character-level mapping should be sparse.

The set of edit operations include character substitutions, insertions and deletions. We assign a variable λe corresponding to every edit operation e.

The set of character correspondences with the variable set to 1 { (u,h) : λ(u,h) = 1 }conveys a set of phonetically valid correspondences.

We define a joint prior over these variables to encourage sparse character mappings.

Step 1 : Structural Sparsity

Page 25: Lost Language Decipheration

This prior can be viewed as a distribution over binary matrices and is defined to encourage every row and column to sum to low values integer values (typically 1)

For a given matrix, define a count c(u) which is the number of corresponding letters that u has in that matrix.Formally, c(u) = ∑h λ(u,h)

We now define a function fi = max(0, |{u : c(u) = i}| - bi)For any i other than 1, fi should be as low as possible.

Now the probability of this matrix is given by

Step 1 : Structural Sparsity (cont.)

Page 26: Lost Language Decipheration

Here Z is the normalization factor and w is the weight vector.

wi is either zero or negative, to ensure that the probability is high for a low value of f.

The values of bi and wi can be adjusted depending on the number of characters in the lost language and the related language.

Step 1 : Structural Sparsity (cont…)

Page 27: Lost Language Decipheration

We now draw a base distribution G0 over character edit sequences.

The probability of a given edit sequence P(e) depends on the value of the indicator variable of individual edit operations λe, and a function depending on the number of insertions and deletions in the sequence, q(#ins(e), #del(e)).

The factor depending on the number of insertions and deletions depends on the average word lengths of the lost language and the related language.

Step 2 : Character-Edit Distribution

Page 28: Lost Language Decipheration

Example: Average Ugaritic word is 2 letters longer than an average Herbew word

Therefore, we set our q to be such as to disallow any deletions and allow 1 insertion per sequence, with the probability of 0.4

The part depending on the λes makes the distribution spike at 0 if the value is 0 and keeps it unconstrained otherwise (spike-and slab priors)

Step 2 :Character-Edit Distribution (cont.)

Page 29: Lost Language Decipheration

The base distribution G0 along with a fixed parameter α define a Dirichlet process, which provides probability over morpheme-pair distributions.

The resulting distributions are likely to be skewed in favor of a few frequently occurring morpheme-pairs, while remaining sensitive to character-level probabilities of the base distribution.

Our model distinguishes between the 3 kinds of morphemes- prefixes, stems and suffixes. We therefore use different values of α

Step 3 : Morpheme Pair-Distributions

Page 30: Lost Language Decipheration

Also, since the suffix and prefix depend on the part of speech of the stem, we draw a single distribution Gstm for the stem, we maintain separate distributions Gsuf|stm and Gpre|stm for each possible stem part-of-speech.

Step 3 : Morpheme Pair-Distributions (cont.)

Page 31: Lost Language Decipheration

Once the morpheme-pair distributions have been drawn, actual word pairs may now be generated.

Based on some prior, we first decide if a word in the lost language has a cognate in the known language.

If it does, then a cognate word pair (u, h) is produced:

Otherwise, a lone word u is generated.

Step 4 : Word Generation

Page 32: Lost Language Decipheration

This model captures both character and lexical level correspondences, while utilizing morphological knowledge of the known language.

An additional feature of this multi-layered model structure is that each distribution over morpheme pairs is derived from the single character-level base distribution G0.

As a result, any character-level mappings learned from one correspondence will be propagated to other morpheme distributions.

Also, the character-level mappings obey sparsity constraints

Summarizing the Model

Page 33: Lost Language Decipheration

Applied on Ugaritic languageUndeciphered corpus contains 7,386 unique

word types.The Hebrew Bible used for known language

corpus, which is close to ancient Ugaritic.Assume morphological and POS annotations

availability for the Hebrew lexicon.

Results of the process

http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf

Page 34: Lost Language Decipheration

The method identifies Hebrew cognates for 2,155 words, covering almost 1/3rd of the Ugaritic vocabulary.

The baseline method correctly maps 22 out of 30 characters to their Hebrew counterparts, and translates only 29% of all the cognates

This method correctly translates 60.4 % of all cognates.

This method yields correct mapping for 29 out of 30 characters.

Results of the process

Page 35: Lost Language Decipheration

Even with character mappings, many words can be correctly translated only by examining their context.

The model currently fails to take the contextual information into account.

Future Work

http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf

Page 36: Lost Language Decipheration

We saw how language decipherment is an extremely complex task.

Years of efforts required for successful decipheration of each lost language.

Depends on the amount of available corpus in the unknown language.But availability does not make it easy.

Statistical model has shown promise.Can be developed further and used for

more languages.

Conclusions

Page 38: Lost Language Decipheration

A staff talk from Straight Dope Science Advisory Board – How come we can’t decipher the Indus Script? (2005) http://www.straightdope.com/columns/read/2206/how-come-we-cant-decipher-the-indus-script

Wade Davis on Endangered Cultures (2008) http://www.ted.com/talks/wade_davis_on_endangered_cultures.html

References