31
Tel More Tel More Telugu Morphological Telugu Morphological Generator Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA UDL 2006: Second International Conference on Universal Digital Libra Alexandria, Egypt November 17-19, 2006

Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

Embed Size (px)

Citation preview

Page 1: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

Tel MoreTel MoreTelugu Morphological Telugu Morphological

GeneratorGenerator

Madhavi Ganapathiraju and Lori Levin

Language Technologies InstituteCarnegie Mellon University

Pittsburgh USA

ICUDL 2006: Second International Conference on Universal Digital LibraryAlexandria, Egypt

November 17-19, 2006

Page 2: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 2

machine translationmachine translation

summarizationsummarization

Information retrievalInformation retrieval

Interface designInterface design digital storagedigital storage

OCROCR

U D L

Page 3: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 3

machine translation

Rani gave the book to my mother1. Output from English Lexical analysis

gave Verb past, root give the book Noun phrase, singular, neutral

mother noun, singular, femininemy possessive, root I

2. English – Telugu Dictionary for root forms of nouns and verbsgive ichchut’abook pustakamumother talli, ammaI neinu

3. TelMore: Morphological generator for Telugu

ichchut’a ichchaad’u (past masc), ichchinadi (past fem), ... Istun’di (future fem), istaad’u (future masc)pustakamu pustakamu, pustakamutoo (with pustakamu), pustakamu loo (in pustakamu)…amma ammaki (to amma), amma cheita (by amma)I naa (possessive)

3. TelMore: Morphological generator for Telugu

1. Phrase match in EBMT Gave to <noun> <noun> ki ichchaad’u

OR

Page 4: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 4

TelMore

Generates morphological forms for nouns and verbs

when the root word is given

Page 5: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 5

About Telugu

2nd largest spoken language in India (?)• 70 M native speakers• World ranking 13-17

– with Korean, Vietnamese, Marathi and Tamil

• 7th century AD recorded origin• literary language in 11th century AD

Page 6: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 6

Parts of Speech: Noun

• Number: singular, plural• Gender: male, female, neutral• Morphological forms: (vibhaktulu)

– nominative, genitive, dative, accusative, vocative, instrumental and locative

14 forms for each noun

Page 7: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 7

Plural formation

General rule is to add “lu” as a suffix;

A series of rules are then applied to yield final form of :

©Õ (lu), ©Õx (llu), @ÁÙ} (l’l’u) or ¢œ¿Õx (n’d’lu)

Page 8: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 8

Parts of Speech: Verb• Number: singular, plural• Gender: male, female, neutral• Voice: 1st person, 2nd person, 3rd person• Morphological forms:

– Present, past, future, aorist affirmative, aorist negative, imperative and prohibitive

– Present participle, past participle : affirmative and negative

• Number of forms: 2 x 3 x 3 x 7 + 4

130 forms for each verb

Page 9: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 9

Features in TelMore (v.1)

• Morphological form generation– Nouns– Verbs

• System– Library module for integration elsewhere– Flat file input & output (plain text or html)– User-interactive through command line– Web interface for data addition with user validation

• Web Interface

Page 10: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 10

Current Data Size

words have been created by native speakers upon request

Nouns

Number of Unique Stems

Number of Morph Forms

1 247 34582 541 75743 28 392

3.2 14 1963.3 8 1123.4 18 2523.5 6 84

Verbs1 55 4180

Total 917 16248

Page 11: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA
Page 12: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 12

Page 13: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 13

Page 14: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 14

Page 15: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 15

Page 16: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 16

Page 17: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 17

Linguistic Knowledge

• The linguistic rules are taken from a book by C.P. Brown – Rules are demonstrated through examples– No formal description

Page 18: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 18

Noun: First Declension Morphs

Page 19: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 19

Noun: Second Declension

Page 20: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 20

Noun: Third Declension

Page 21: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 21

Noun: Third Declension:Irregular 2

Page 22: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 22

Noun: Third Declension: Irregular 3

Page 23: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 23

Noun: Third Declension: Irregular 4

Page 24: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 24

Noun: Third Declension: Irregular 5

Page 25: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 25

Verb: First Conjugation

Page 26: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 26

Verb: Second Conjugation

Page 27: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 27

Verb: Third Conjugation

Page 28: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 28

Alternate dialects and spellings

Telugu is spoken in many dialects– Andhra Pradesh has long

borders with 4 states each of which speaks a different language, and one long coastal region

– Dialects in each of these regions is different

– learned and the others speak different dialects

– Urdu influence in Hyderabad due to Muslim rule

– pure/poetic formal/informal

Telugu is written the way it is spoken

Hence the different dialects result in different spellings of the words

Page 29: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 29

Future work for this tool

• Causative, middle and passive voices to be added

• Morphology of adjectives, etc

• Integration of Om native font integration for flat file processing

• Integration with English Lexicon to be of real use in multilingual applications

Page 30: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 30

Acknowledgemen

Acknowledgemen

tsts Prof. Lori LevinLinguistics Advisor

Prof. Raj ReddyProf. N. Balakrishnan

UDL Advisors

R. HarshaNaveena YanamalaWeb-interface creation

Data Creation …V. Mythili ShyamG. Padmasree

V. AbhinayB.V. Prashanth

G. Ramana LakshmiG. PadmavathyV. Nava Mallika

Page 31: Tel More Telugu Morphological Generator Madhavi Ganapathiraju and Lori Levin Language Technologies Institute Carnegie Mellon University Pittsburgh USA

19th Nov, 2006 ICUDL2006: TelMore - Telugu Morphological Generator 31

http://linzer.blm.cs.cmu.edu/morph/http://linzer.blm.cs.cmu.edu/morph/www.cs.cmu.edu/~madhaviwww.cs.cmu.edu/~madhavi