33
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Language technology for morphologically rich languages Language technology for morphologically rich languages Trond Trosterud Giellatekno, Centre for Saami Language Technology http://giellatekno.uit.no/ September 5, 2017

Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languages

Language technologyfor morphologically rich languages

Trond TrosterudGiellatekno, Centre for Saami Language Technology

http://giellatekno.uit.no/

September 5, 2017

Page 2: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languages

Contents

A very subjective history of language technology

A model for all the other languages

Conclusion

Page 3: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA very subjective history of language technology

A very subjective history of language technology

▶ The computers came with the cold warOur task was to build MT from Russian to English

▶ First attempt (ask the cryptographers):▶ Machine translation seen as a noisy channel?

▶ Second attempt (ask the linguists):▶ Generative grammar promised to ...

generate grammatical sentences

▶ 1966: The Alpac report▶ We (the linguists) had failed

Page 4: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA very subjective history of language technology

Some of the critique is still valid

▶ Bar-Hillel 1960:▶ Little John was looking for his toy box. Finally he found it.

The box was in the pen.▶ Google Translate 2017:

▶ Lille John var på utkikk etter sin leketøyboks. Til slutt fanthan det. Boksen var i pennen.

Page 5: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA very subjective history of language technology

The post-Alpac world of formal linguistics 1

▶ Not that much MT for a long while, but:▶ Formal linguistics

▶ Until 1980: Chomskyan generative grammar▶ After 1980: Chomsky went for ”Universal Grammar”

(= left the field of grammar modelling)▶ Alternative generative models (LFG, HPSG)

▶ did not result in robust parsers

Page 6: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA very subjective history of language technology

The post-Alpac world of formal linguistics 2

▶ An alternative approach to morphophonology▶ C. Douglas Johnson 1972:

Formal Aspects of Phonological Descriptionrewrite-rules ( A → B | C _ D ) as finite-state transducers

▶ Kimmo Koskenniemi 1983: Rewrite rules as parallel relations▶ Around 1990: Xerox builds efficient compilers

▶ The word form problem was solved(we will return to the relevance this has fortomorrow’s shared task)

Page 7: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA very subjective history of language technology

In came the nineties

▶ Finally, the linguists had broken the code:we came up with a technology combining robustness anddepth

1. Finite-state transducers had solved analysis / generation2. Constraint grammar solved the homonymy problem

▶ Disambiguating ambiguity in context:John tries to walk the walk==> context-sensitive disambiguation rules(Fred Karlsson, Pasi Tapanainen, Eckhard Bick)

▶ Our moment in the limelight:The British National Corpus was annotated byFinite-state transducers and Constraint Grammar

Page 8: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA very subjective history of language technology

Then two things happened:

1. The inventors of these techniques commercialised themand lifted it out of the common development(thus there were no open compilers or grammars,but grammar checkers for MS Word, annotating gold corporafor statistical models)

2. Computers got faster and the algorithms better

Page 9: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA very subjective history of language technology

Statistical methods won the day

▶ Every time I fire a linguist my system improves▶ Morphology is handled by lists▶ Different types of processing is handled via machine learning

▶ Performance went down, but algorithms were opengood data were closed!

Page 10: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA very subjective history of language technology

A side note on language typology

▶ We know this quote: “Take a language like, say,▶ But languages are not like English

Page 11: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA very subjective history of language technology

Page 12: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA very subjective history of language technology

There is a growing interest in extending the scope oflanguage technology

▶ (cf this workshop)▶ A natural choice (?): extend the model we had for English,

to these other languages▶ So far, not too many success stories on this front

(there are taggers, but not that many end-user applications)▶ No spellchecker for any North American languages▶ Very few languages have grammar checkers▶ Far worse MT into Finnish than into other EU languages▶ Bad MT between, say, Swedish and Norwegian▶ In short, a paucity of working solutions for the majority of

languages

Page 13: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA very subjective history of language technology

Meanwhile, in the grammatical camp:

▶ We have been extending the domain of the rules from thenineties

▶ Adding:▶ grammatical functions▶ semantic roles▶ dependency relations

▶ into a both robust and deep analysis(dependency annotation at > 95%)

▶ and we have got open compilers▶ ... but our time in the limelight is over

Page 14: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

So, the limelight is gone, but here I am, on another scene

▶ As witnessed by the growing concern and a growing numberof workshops:The morphologically rich languages are not that easy

▶ Identifying the morphemes is not enough▶ Perhaps we should have a second look at what happened in

the nineties

Page 15: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

My answer: A viable model for “all the other languages”

▶ Each language needs a team▶ Programmer (shared)▶ Computational linguist (shared)▶ Linguist▶ ... and eventually a native speaker (and preferably linguist)

▶ Here is the thing:For every language, there is a linguisthaving devoted his or her life to itlanguage technology has something to offer:==> a test bed for his or her grammatical model

▶ Each team would share the common infrastructureThe Linux model, as it were

Page 16: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

But can we repeat the Linux model for languagetechnology?

▶ It turns out we can▶ cf. two examples▶ http://giellatekno.uit.no/doc/lang/▶ http://wiki.apertium.org/

Page 17: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

Language technology in practice

1. Common, scaleable infrastructure2. Language models3. A pipeline for making practical applications

Page 18: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

Common, scaleable infrastructure

Figure: A schematic overview of the Giella infrastructure

Page 19: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

Figure: Circumpolar languages in the Giella infrastructure

Page 20: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

The language models

Figure: A composed finite-state transducer gives the accusative of NorthSaami gussa, ‘cow’

Page 21: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

Figure: Automaton and transducer

Page 22: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

Figure: Combining writing systems

Page 23: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

The interface to applications

▶ morphological analyser▶ are turned into spellcheckers for LibreOffice and MS Office▶ There are two North American spell checkers

▶ + dictionaries▶ gives click-in-text e-dictionaries

▶ + lexical selection and transfer rules▶ gives machine translation

▶ Also: Keyboards for all platforms

Page 24: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

Compilation is standardised

▶ All this with one command:make APPLICATION for LANGUAGE

Page 25: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

Tools

▶ http://gtweb.uit.no/korp▶ http://sanit.oahpa.no▶ http://giellatekno.uit.no/doc/infra/

GettingStarted.html

Page 26: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

Figure: The S curve

Page 27: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

Rule-based machine translation

▶ Bar-Hillel’s critique was flawed (and unfair towards GoogleTranslate)

▶ His example was not authentic▶ and the purpose of the MT program he envisaged was not

stated▶ One important points: Rule-based system may correct errors▶ Another point: We may get efficient text production systems

between closely related languages

Page 28: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

The Apertium languages A-K

Aekyom, Afrikaans, Albanian, Arabic, Aragonese, Armenian, Assamese,Asturian, Avaric, Aymara, Azerbaijani, Bashkir, Basque, Belarusian,Bengali, Bislama, Breton, Bulgarian, Buriat, Catalan, Cebuano, CentralKurdish, Chinese, Chukot, Church Slavic, Chuvash, Corsican, CrimeanTatar, Cusco Quechua, Czech, Danish, Dargwa, Dhivehi, Dolgan,Domung, Dutch, Eastern Apurímac Quechua, Eastern Mari, English,Erzya, Esperanto, Estonian, Evenki, Faroese, Finnish, French, Gagauz,Galician, Ganda, Georgian, German, Gilaki, Guarani, Gujarati, Haitian,Halh Mongolian, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Igbo, InariSami, Indonesian, Ingush, Interlingua, Interlingua (International AuxiliaryLanguage Association), Iranian Persian, Irish, Italian, Kara-Kalpak,Karachay-Balkar, Karelian, Kashmiri, Kashubian, Kazakh, Khakas,Kirghiz, Komi, Komi-Zyrian, Korean, Kumyk, Kven Finnish

Page 29: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

The Apertium languages L-Z

Lao, Latin, Latvian, Lingala, Lithuanian, Liv, Livvi, Lower Sorbian, Luang,Lule Sami, Luxembourgish, Macedo-Romanian, Macedonian, Malay,Malay , Malayalam, Maltese, Manx, Marathi, Mari, Medumba, ModernGreek, Moksha, Morisyen, Nanai, Neapolitan, Nepali, Nogai, NorthernKurdish, Northern Sami, Norwegian, Norwegian Bokmål, NorwegianNynorsk, Occitan, Ossetian, Ottoman Turkish, Panjabi, PeripheralMongolian, Persian, Polish, Portuguese, Romanian, Romansh, Rundi,Russian, Sanskrit, Sardinian, Scots, Scottish Gaelic, Serbo-Croatian,Sicilian, Sindhi, Sinhala, Slovak, Slovenian, Southern Altai, SouthernSami, Spanish, Spanish Sign Language, Standard Latvian, Swahili, Swati,Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Tetum, Thai, Turkish,Turkmen, Tuvinian, Udmurt, Uighur, Ukrainian, Upper Sorbian, Urdu,Uzbek, Vietnamese, Vlax Romani, Võro, Wayuu, Welsh, Western Frisian,Western Mari, Wolaytta, Xhosa, Xibe, Yakut, Yiddish, Yoruba, Zulu

Page 30: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

Am I against statistical models?

▶ No.▶ Admittedly, my motto is “Don’t guess, if you know”

▶ But I imagine you can make better guesses the more you know▶ So, check whether your language is on the Apertium or Giella

lists above before you start guessing

Page 31: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

The place for statistical models

▶ Language is complex, and many facets are learned via big data(caveat: not that big for minority languages)

▶ So, in a way, I suggest “business as usual”,but with a sounder foundation

Page 32: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesA model for all the other languages

Don’t guess if you know?

▶ An apropos to the forthcoming shared task▶ There is no need to look for the morphemes in the languages

of the shared task▶ All of them have been analysedalready (open source)

and we even distinguish between their different grammaticalinterpretation

▶ So I really would like to see what you could achieve standingon our shoulders

Page 33: Language technology for morphologically rich languagesmlp.computing.dcu.ie/mlp2017/docs/trosterud.pdf · Language technology for morphologically rich languages A very subjective history

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language technology for morphologically rich languagesConclusion

Conclusion

▶ The challenge posed by morphologically rich languages havebeen solved

▶ The fact that the solution isn’t fashionable at the momentshould not prevent us from making use of it

▶ More fashionable approaches are welcome to take it from there