35
1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES OF INDIA SOME ISSUES Dr.B.Mallikarjun Central Institute of Indian Languages Mysore 570 006, INDIA [email protected] www.ciil.org/faculty/mallikarjun.html www.ciilcorpora.net

1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

Embed Size (px)

Citation preview

Page 1: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

1

EACL 2003, Budapest : April 12 – 17, 2003 

Computational Linguistics for South Asian Languages

Expanding Synergies with Europe

CORPORA IN MINOR LANGUAGES OF INDIASOME ISSUES

 Dr.B.Mallikarjun

Central Institute of Indian LanguagesMysore 570 006, INDIA

[email protected]/faculty/mallikarjun.html

www.ciilcorpora.net

Page 2: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

1. Current status of corpora – major Indian languages

2. Current status of corpora - minor Indian languages

3. Importance of minor languages corpora

4. Objectives

5. Categorization of minor languages for corpora building

6. Minor languages: A sample

7. Issues in corpora building

8. Corpus processing tools – a. Basic b. Advanced

9. Conclusion and a mission EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 3: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

3

India has 1652 mother tongues of 4 families.The Constitution of India in 8th Schedule

has recognized 18 languages spoken by 96.29% of the population.

EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Assamese : 2,622,836 Bengali : 3,535,863 Gujarati : Hindi : 3,003,004 Kannada : 2,239,537 Kashmiri : 2,266,588 Konkani : Malayalam: 2,349,526 Manipuri :

Marathi : 2,213,241 Nepali : Oriya : 2,727,670 Punjabi : 1,966,260 Sanskrit: Sindhi : Tamil : 3,381,525 Telugu : 3,967,926 Urdu : 1,64,125

Page 4: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

4

* Different quantum.

* Comparable quality.

* Quantum and coverage is inadequate for wider NLP activities.

* Needs to be augmented with wider coverage.

* Enhancing attempts have some problems needing immediate solution.

EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 5: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

5

* 1634 are minor languages spoken by 3.71% of the population.

* Indo-Aryan and Dravidian language families have both major and minor languages.

* Almost all the languages of the other two families, Munda and Tibeto-Burman are “minor” languages.

 * Text corpora building has not taken place in these languages.

EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 6: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

6

Minor languages hardly attract the attention of the policy makers anywhere in the world.

These are endangered in Indian social, educational and linguistic contexts.

Linguists evince great interest to study the richness of languages and try to save the endangered languages from extinction.

EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 7: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

7

They hardly attract and become source for technological research.

Technology has made it possible to empower all languages whether they are major or minor ones.

Creating corpora in minor languages, especially those that have small or no written literature have certain critical advantages for linguistic computing.

Experimentation with corpora designs and standards is more easily done in these languages because of manageable quantum of data.

EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 8: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

8

Archival and cross-linguistic comparison within a language family and across language families. Utilize language technology for their preservation and continued use.

Fine-tune language analysis where grammatical analysis is available. Use machine readable form of the texts to produce possibly precise analysis of the language where ever such analysis is not available. Also use some of the minor languages corpora for machine translation purposes.

Speech corpora too has more significance in minor languages, since most of them exist in spoken form and many are yet to be rendered into written form.

Indigenous knowledge systems: Most of the minor languages are resources of cultural heritage and a treasure house of indigenous knowledge systems. Once the same is available in the machine readable form by using UNL can be made available to the universal knowledge base.

EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 9: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

9

Minor languages can be classified into 3 groups on the basis of the issues to be tackled while building corpora.

First category : Languages other than the 18 major languages having good amount of literary and other texts and also used in wider domains like : Bodo, Kurukh, Maithili, Santhali, Tripuri etc.

Second category : Languages are the once with limited quantity of written texts but not widely used in different domains such as education, administration etc. like : Kodava, Tulu, etc.

Third category : Languages available only in spoken form and yet to be rendered into written form like Toda, Kota, Yerava, etc.

EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 10: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

10

13,689No script

Indigenous KnowledgeSystem

DravidianYerava

97,011KannadaVery lessDravidianKodava or Coorgi

77,66,597DevanagariYesIndo AryanMaithili

No. of speakersScriptTextLg.familyName

These languages are representative of the ground linguistic reality in India.

EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 11: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

11

In-compatibility of adopted software not accommodative of all the features of Maithili, Kodava and Yerava

Standard software based on the grammar of the concerned script and UNICODE for Kannada: - 1, 2, 3, 4.

Technical:

key-board, input and storage

All available text / All transcribed speechMaithili, Kodava and Yerava

Sampling - domainsPeriod

Text

Minor languageMajor languageIssue

EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 12: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

12EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Frequency count of words and syllables :

The facilities created for languages like Hindi and Kannada are there and where ever necessary language specific modifications are made and used.

Page 13: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

13

Comparison of Maithili, Kodava and Yerava Corpora

3.105.703.52Average Word length%

rurakaMost frequent Syllable

3030605051902Word types

38819432328146Corpus size

YeravaKodavaMaithiliStatistical distribution

EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 14: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

14

4.364.714.96 4.963.52Average Word length%

kakakakakaMost frequent Syllable

24745476407195318986051902Word types

671171156677931407292327129328146Corpus size

Hindi(Premchand)

Hindi(India Today)

Hindi(Naiduniya)

Hindi (CIIL)

MaithiliStatistical distribution

EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 15: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

15

r a r ur ur aMost frequent Syllable

6.938.424.364.64Average sentence length %

10.258.683.105.70Average Word length%

52680234685030306050Word types

2119935197798738819432Corpus size

MalayalamKannadaYeravaKodaguStatistical distribution

EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 16: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

16

1. Key Word in Context2. Search by required word3. Sorting and indexing

The facilities created for languages like Hindi and Kannada are there and where ever necessary language specific modifications can be made and used.

EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 17: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

17

1. Part-of-speech tagging2. Morphological analyzer

EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 18: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

18

1. Non availability of standard basic tag set is one of the major drawbacks.

2. Each Institution/group of scholars use their own notations: CLAWS, Research institution in IT, CIIL(Maj lg.), CIIL(Min lg.)

3. The tagging tools being developed even for major languages are at different stages of development.

4. The POS tagging tool developed for Hindi can be tried out at the first instance on Maithili to see its viability. Hindi too is not having fully working POS tagging tool.

5. Due to limited data in Kodava and Yerava manual tagging is preferred.

EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 19: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

19

The Morphological Analyzers designed for the minor languages of India should be sensitive enough to take care of their specific features.

1. Tagged lexicon2. Rules to cover the processes of:

Inflection - Suffixing is normally based on word endingDerivation – Both prefixing and suffixing are possible – depends on lexical item

EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 20: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

20

Yerava word ‘-ati’ has three meanings such as ‘to sweep’, ‘wind blow’ and ‘bottom’ for which meaning has to be taken depending upon the context. In such of these cases the morphological analyzer demands a semantic tool.

Kodava word bappe has the meaning ‘I am coming’ but when it is used in the context of leave taking, it means, ‘I am leaving.’ Cultural nuances in the context of leave taking do not allow one to use the word poope ‘going or leaving’ because it would only mean that the person is saying the ultimate good-bye to this world. It is possible to judge the meaning of such words only with the knowledge of the culture represented by a language.

EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 21: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

21

Ambiguities are seen in three senses - Word sense, Pronoun sense and Structural sense. Word sense ambiguities are words having multiple meanings that will be found in all the languages. With regard to the second one, pronominal and adjectival anaphora are also ambiguities. In English, disambiguation tools have been developed. After the inception of a few lexical databases such as Word Net, Euro Net, etc., researchers seem to have overcome the ambiguity problem to certain extent.

In the case of Indian languages, however, in the absence of such a sensitive tool, one has to work manually in order to cross over disambiguate even in the case of major languages.

Minor languages need better linguistic analysis to arrive at tangible and usable disambiguation procedures.

EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 22: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

22

India abounds in many endangered languages. Technology can actually help maintain a language.

Technology should immediately take into account the concerns of minority languages. Especially, major language technologies of the region should accommodate the needs of the minor languages too.

Corpora building in minor languages poses new challenges to innovate novel ways to accommodate and adequately describe the distinctive features of these languages.

Comparison of corpora studies - within a family of languages, across the families of languages and at the international level will be helpful in bringing out a standard module of developing corpora.

EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 23: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

23EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Thank You

Page 24: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

24EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

8.1 Kannada Code Chart

Page 25: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

25EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 26: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

26EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 27: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

27EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 28: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

28

DemographyAstrologyCriminologyPhysical Education / SportsHealth and Family WelfareForestrySexologyCulture & AnthropologyCommerceBankingAccountancyIndustry & handicraftsFinanceTextile TechnologyOfficial And Media LanguagesMass MediaLegislativeAdministrativeTranslated MaterialLiteratureScientificLegalAdministrationTranslated PsychologyEACL 2003, CLSAL: Budapest – April 12 – 17, 2003

AestheticsLiterature Novel Short Story Essays Criticism Humour Children 's Literature Biographies &

Autobiographies TraveloguesLetters/Diaries/

Speeches Plays Science Fiction Folk Tales Text Books(School) Social SciencesFine Arts Music Dance/Impersonations Drawing Sculpture Musical Instruments Hobbies

Natural, Physical And Professional SciencesBotanyZoologyGeologyGeographyBio ChemistryMicro BiologyPhysicsChemistryMathematicsStatisticsComputer SciencesAstronomyText book(Science)MedicineAyurvedaHomeopathyYogaNaturopathyEngineeringArchitectureOceanologyAgricultureVeternary

Film TechnologyPhotographyMarine BiologyFisheriesTextile TechnologySocial SciencesSociologyLinguisticsPsychologyAnthropologyHistory, Archeology, EpigraphyPolitical ScienceHome ScienceLibrary ScienceReligion, PhilosophyEconomicsLogicJournalismFolklore/MythologyPublic AdministrationLawBusiness ManagementEducationText Books-Social Science

Page 29: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

29EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 30: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

30EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 31: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

31EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 32: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

32EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 33: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

33EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 34: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

34EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Page 35: 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES

35EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

Thank You