23
CG in Apertium Kevin Brubeck Unhammer University of Bergen, Norway 14th May 2009

Constraint Grammar and Apertium

Embed Size (px)

DESCRIPTION

Apertium is a free and open source MT platform, where both the linguistic data and engines are under free licences. Constraint Grammar is used for pre-disambiguation in several language pairs.

Citation preview

Page 1: Constraint Grammar and Apertium

CG in Apertium

Kevin Brubeck UnhammerUniversity of Bergen, Norway

14th May 2009

Page 2: Constraint Grammar and Apertium

What is Apertium?

I An Open Source Machine Translation platformI both source code and data have Free / Open Source licences

I ModularI stand-alone programs communicate through standard Unix pipesI particular language pairs need not use all modules!

I Developed by universities, companies and independent(volunteer and paid) developers

Page 3: Constraint Grammar and Apertium

History of Apertium

I Initially developed for closely related languages (Portuguese↔Spanish↔ Catalan) by the Transducens group at the Universitatd’Alacant

I Later extended to allow more distant language pairs

I Now also involves various companies in Spain, the universities ofVigo, Reykjavík, Oviedo, Barcelona (Pompeu Fabra), etc.

Page 4: Constraint Grammar and Apertium

Language pairs

I “Stable”: Spanish↔ Catalan, Spanish← Romanian, French↔Catalan, Occitan↔ Catalan, English↔ Galician, Occitan↔Spanish, Spanish↔ Portuguese, English↔ Catalan, English↔Spanish, English→ Esperanto, Spanish↔ Galician, French↔Spanish, Esperanto← Spanish, Welsh→ English, Esperanto←Catalan, Portuguese↔ Catalan, Portuguese↔ Galician,Basque→ Spanish

I Other pairs being developed (Spanish↔ Asturian, Icelandic↔English, Swedish↔ Danish, Nynorsk↔ Bokmål, . . . )

Page 5: Constraint Grammar and Apertium

Marginalised

Few free resources

Copious free resources

Page 6: Constraint Grammar and Apertium

Modules

I Morphological dictionariesI lttoolbox: XML format, compiles to FSTs

I Fast (seems to perform 5x faster than SFST)I one dictionary gives both analysis and generation

I CG pre-disambiguation

I Statistical disambiguation (HMM)

I Bilingual dictionary for lexical transfer

I Shallow syntactic transfer rules

I Local re-ordering (nom adj→ adj nom)I Chunking (adj adj nom→ SN[adj adj nom])I Insertions, deletions and substitutions of lexical units and chunks

Page 7: Constraint Grammar and Apertium

A sketch of the architecture

Page 8: Constraint Grammar and Apertium

The Apertium Stream Format

I Simple example from Norwegian BokmålI “lese en” (‘read a/one’)I Morphological analysis gives:

^lese/lese<vblex><inf>$ ^en/en<num><sg><mf>/ene<vblex><imp>/en<det><ind><mf><sg>$

I After CG:

^lese/lese<vblex><inf>$ ^en/en<num><sg><mf>/en<det><ind><mf><sg>$

I Formatting information (like HTML tags) is saved in superblanksmaking document and web translation easy

I original:

Kva er det du <em>seier</em>?I deformatted:

Kva er det du[ <em>]seier[<\/em>]?

Page 9: Constraint Grammar and Apertium

Visualising the process helps find errors

Page 10: Constraint Grammar and Apertium

The platform provides

I a language-independent machine translation engine

I tools to manage the linguistic data necessary to build a machinetranslation system for a given language pair

I little programming knowledge required to get startedI graphical user interfaces that show each step in the translation

processI many more advanced tools (for eg. merging or sorting

dictionaries)

I linguistic data for a growing number of language pairs

I also usable for other NLP purposes (spelling & grammar checking,. . . )

Page 11: Constraint Grammar and Apertium

CG in Apertium

I Used after morphological analysis for pre-disambiguation inNynorsk↔ Bokmål, Welsh↔ English, Breton↔ French, Irish↔Scottish Gaelic

I Apertium’s own statistical disambiguator makes a choice if CGdoesn’t completely disambiguate

Page 12: Constraint Grammar and Apertium

CG in Apertium

I Norwegian CG is from the Oslo-Bergen Tagger (GPL)

I Sámi giellatekno provides Free grammars for Sámi languagesand Faroese

I Irish grammar mostly converted manually from the An Gramadóirproject (GPL)

I Other grammars made solely by Apertium members

Page 13: Constraint Grammar and Apertium

Some statistics

Sections Rules Sets Tags

Welsh 2 98 141 128Breton 4 121 125 154Irish 1 285 298 292

Table: Rule counts for some of the CG grammars in Apertium

Page 14: Constraint Grammar and Apertium

Same concepts apply between modules

CG Apertium/lttoolbox Apertium stream formatwordform surface form booksbaseform lemma bookcohort ambiguous lexical unit ^books/book<n><pl>

/book<vblex><pres><p3><sg>$reading analysis /book<n><pl>/

Table: Terminology differences

Page 15: Constraint Grammar and Apertium

Same format readable by all modules

I Both SFST/HFST and vislcg3 read and write the Apertium streamformat.

I Example from the Open Morphology of Finnish, output by theApertium reader in SFST/HFST:

^kaikki/kaikki<noun><7><a><sg><nom>$^ihmiset/ihminen<noun><38><pl><acc>/ihminen<noun><38><pl><nom>$^syntyvät/syntyä<verb><52><j><act><pcpva><pl><acc>/syntyä<verb><52><j><act><pcpva><pl><nom>/syntyä<verb><52><j><act><indv><pres><pl3>$^vapaina/vapaa<noun><17><pl><ess>$ ^ja/*ja$^tasavertaisina/*tasavertaisina$^arvoltaan/arvo<noun><1><sg><abl><pl3>/arvo<noun><1><sg><abl><sg3>$^ja/*ja$^oikeuksiltaan/oikeus<noun><40><pl><abl><pl3>/oikeus<noun><40><pl><abl><sg3>$

Page 16: Constraint Grammar and Apertium

Why Apertium

I Rule-based MT

I most languages of the world have little freely available textualdata, let alone parallel corpora for SMT purposes; Apertium isthus suitable for marginalised languages

I Rule-based systems are linguistically interesting, and provide testbeds for linguistic theory

I Reuse and Interoperability

I Monolingual dictionaries and constraint grammars are directlyreusable for new language pairs

I apertium-dixtools: generates new language pairs from existingones

I vislcg3 reads and outputs the Apertium stream format, as doStuttgart/Helsinki Finite State Tools

I Free licences allow other systems to use Apertium data and tools

Page 17: Constraint Grammar and Apertium

Why Apertium

I Open Source + fairly simple learning curve = great potential forcontributors

I Eg. Jacob Nordfalk: entered Apertium last fall, had English→Esperanto pair by March 2009

I Very helpful and accessible community

Page 18: Constraint Grammar and Apertium

Future work: dependency-based reordering in Apertium

I Currently, CG is only used for disambiguation

I Many constraint grammars out there give dependencyinformation, this could be integrated into Apertium to providedependency based reordering, simplifying the transfer step

Page 19: Constraint Grammar and Apertium

Future Work: integration with Matxin

I Matxin is a Free Software sister project of Apertium whichcurrently uses FreeLing for dependency analyses:

<SENTENCE ord=’1’><CHUNK ord=’2’ type=’grup-verb’ si=’top’>

<NODE ord=’4’ alloc=’19’ form=’sacude’ lem=’sacudir’ mi=’VMIP3S0’> </NODE><CHUNK ord=’1’ type=’sn’ si=’subj’><NODE ord=’3’ alloc=’10’ form=’atentado’ lem=’atentado’ mi=’NCMS000’>

<NODE ord=’1’ alloc=’0’ form=’Un’ lem=’uno’ mi=’DI0MS0’> </NODE><NODE ord=’2’ alloc=’3’ form=’triple’ lem=’triple’ mi=’AQ0CS0’> </NODE>

</NODE></CHUNK><CHUNK ord=’3’ type=’sn’ si=’obj’><NODE ord=’5’ alloc=’26’ form=’Bagdad’ lem=’Bagdad’ mi=’NP00000’> </NODE>

</CHUNK><CHUNK ord=’4’ type=’F-term’ si=’modnomatch’><NODE ord=’6’ alloc=’32’ form=’.’ lem=’.’ mi=’Fp’> </NODE>

</CHUNK></CHUNK></SENTENCE>

Page 20: Constraint Grammar and Apertium

Future work: integration with Matxin

I We would like to get CG dependency information into aMatxin-compatible format.

I Apertium’s CG would handle analysis while Matxin handles thetransfer step. Eg. given the following analysis (Faroese):

"<Í>""í" Pr @ADVL> #1->3

"<upphavi>""upphav" N Neu Sg Dat Indef @P< #2->1

"<skapti>""skapa" V Ind Prt Sg @VMAIN #3->0

"<Gud>""gudur" N Msc Sg Acc Indef @<SUBJ #4->3

"<himmal>""himmal" N Msc Sg Acc Indef @<OBJ #5->3

Page 21: Constraint Grammar and Apertium

Future work: integration with Matxin

I ...we would like to get this dependency tree structure:

<SENTENCE ord="1"><NODE form=’skapti’ lem=’skapa’ ord=’3’ mi=’V.Ind.Prt.Sg’ si=’VMAIN’><NODE form=’Í’ lem=’Í’ ord=’1’ mi=’Pr’ si=’ADVL’>

<NODE form=’upphavi’ lem=’upphav’ ord=’2’ mi=’N.Neu.Sg.Dat.Indef’ si=’P’/></NODE><NODE form=’Gud’ lem=’Gud’ ord=’4’ mi=’N.Prop.Sg.Nom’ si=’SUBJ’/><NODE form=’himmal’ lem=’himmal’ ord=’5’ mi=’N.Msc.Sg.Acc.Indef’ si=’OBJ’/>

</NODE></SENTENCE>

I and let Matxin do reordering and other transfer operations

Page 22: Constraint Grammar and Apertium

Thanks for listening!

Page 23: Constraint Grammar and Apertium

Licences

This presentation may be distributed under the terms of the GNU GPL,GNU FDL and CC-BY-SA licences.

I GNU GPL v. 3.0http://www.gnu.org/licenses/gpl.html

I GNU FDL v. 1.2http://www.gnu.org/licenses/gfdl.html

I CC-BY-SA v. 3.0http://creativecommons.org/licenses/by-sa/3.0/