Constraint Grammar and Apertium

CG in Apertium

Kevin Brubeck UnhammerUniversity of Bergen, Norway

14th May 2009

What is Apertium?

I An Open Source Machine Translation platformI both source code and data have Free / Open Source licences

I ModularI stand-alone programs communicate through standard Unix pipesI particular language pairs need not use all modules!

I Developed by universities, companies and independent(volunteer and paid) developers

History of Apertium

I Initially developed for closely related languages (Portuguese↔Spanish↔ Catalan) by the Transducens group at the Universitatd’Alacant

I Later extended to allow more distant language pairs

I Now also involves various companies in Spain, the universities ofVigo, Reykjavík, Oviedo, Barcelona (Pompeu Fabra), etc.

Language pairs

I “Stable”: Spanish↔ Catalan, Spanish← Romanian, French↔Catalan, Occitan↔ Catalan, English↔ Galician, Occitan↔Spanish, Spanish↔ Portuguese, English↔ Catalan, English↔Spanish, English→ Esperanto, Spanish↔ Galician, French↔Spanish, Esperanto← Spanish, Welsh→ English, Esperanto←Catalan, Portuguese↔ Catalan, Portuguese↔ Galician,Basque→ Spanish

I Other pairs being developed (Spanish↔ Asturian, Icelandic↔English, Swedish↔ Danish, Nynorsk↔ Bokmål, . . . )

Marginalised

Few free resources

Copious free resources

Modules

I Morphological dictionariesI lttoolbox: XML format, compiles to FSTs

I Fast (seems to perform 5x faster than SFST)I one dictionary gives both analysis and generation

I CG pre-disambiguation

I Statistical disambiguation (HMM)

I Bilingual dictionary for lexical transfer

I Shallow syntactic transfer rules

I Local re-ordering (nom adj→ adj nom)I Chunking (adj adj nom→ SN[adj adj nom])I Insertions, deletions and substitutions of lexical units and chunks

A sketch of the architecture

The Apertium Stream Format

I Simple example from Norwegian BokmålI “lese en” (‘read a/one’)I Morphological analysis gives:

^lese/lese<vblex><inf>$ ^en/en<num><sg><mf>/ene<vblex><imp>/en<det><ind><mf><sg>$

I After CG:

^lese/lese<vblex><inf>$ ^en/en<num><sg><mf>/en<det><ind><mf><sg>$

I Formatting information (like HTML tags) is saved in superblanksmaking document and web translation easy

I original:

Kva er det du <em>seier</em>?I deformatted:

Kva er det du[ <em>]seier[<\/em>]?

Visualising the process helps find errors

The platform provides

I a language-independent machine translation engine

I tools to manage the linguistic data necessary to build a machinetranslation system for a given language pair

I little programming knowledge required to get startedI graphical user interfaces that show each step in the translation

processI many more advanced tools (for eg. merging or sorting

dictionaries)

I linguistic data for a growing number of language pairs

I also usable for other NLP purposes (spelling & grammar checking,. . . )

CG in Apertium

I Used after morphological analysis for pre-disambiguation inNynorsk↔ Bokmål, Welsh↔ English, Breton↔ French, Irish↔Scottish Gaelic

I Apertium’s own statistical disambiguator makes a choice if CGdoesn’t completely disambiguate

CG in Apertium

I Norwegian CG is from the Oslo-Bergen Tagger (GPL)

I Sámi giellatekno provides Free grammars for Sámi languagesand Faroese

I Irish grammar mostly converted manually from the An Gramadóirproject (GPL)

I Other grammars made solely by Apertium members

Some statistics

Sections Rules Sets Tags

Welsh 2 98 141 128Breton 4 121 125 154Irish 1 285 298 292

Table: Rule counts for some of the CG grammars in Apertium

Same concepts apply between modules

CG Apertium/lttoolbox Apertium stream formatwordform surface form booksbaseform lemma bookcohort ambiguous lexical unit ^books/book<n><pl>

/book<vblex><pres><p3><sg>$reading analysis /book<n><pl>/

Table: Terminology differences

Same format readable by all modules

I Both SFST/HFST and vislcg3 read and write the Apertium streamformat.

I Example from the Open Morphology of Finnish, output by theApertium reader in SFST/HFST:

^kaikki/kaikki<noun><7><a><sg><nom>$îhmiset/ihminen<noun><38><pl><acc>/ihminen<noun><38><pl><nom>$^syntyvät/syntyä<verb><52><j><act><pcpva><pl><acc>/syntyä<verb><52><j><act><pcpva><pl><nom>/syntyä<verb><52><j><act><indv><pres><pl3>$^vapaina/vapaa<noun><17><pl><ess>$ ^ja/*ja$^tasavertaisina/*tasavertaisina$ârvoltaan/arvo<noun><1><sg><abl><pl3>/arvo<noun><1><sg><abl><sg3>$^ja/*ja$ôikeuksiltaan/oikeus<noun><40><pl><abl><pl3>/oikeus<noun><40><pl><abl><sg3>$

Why Apertium

I Rule-based MT

I most languages of the world have little freely available textualdata, let alone parallel corpora for SMT purposes; Apertium isthus suitable for marginalised languages

I Rule-based systems are linguistically interesting, and provide testbeds for linguistic theory

I Reuse and Interoperability

I Monolingual dictionaries and constraint grammars are directlyreusable for new language pairs

I apertium-dixtools: generates new language pairs from existingones

I vislcg3 reads and outputs the Apertium stream format, as doStuttgart/Helsinki Finite State Tools

I Free licences allow other systems to use Apertium data and tools

Why Apertium

I Open Source + fairly simple learning curve = great potential forcontributors

I Eg. Jacob Nordfalk: entered Apertium last fall, had English→Esperanto pair by March 2009

I Very helpful and accessible community

Future work: dependency-based reordering in Apertium

I Currently, CG is only used for disambiguation

I Many constraint grammars out there give dependencyinformation, this could be integrated into Apertium to providedependency based reordering, simplifying the transfer step

Future Work: integration with Matxin

I Matxin is a Free Software sister project of Apertium whichcurrently uses FreeLing for dependency analyses:

</NODE></CHUNK><CHUNK ord=’3’ type=’sn’ si=’obj’><NODE ord=’5’ alloc=’26’ form=’Bagdad’ lem=’Bagdad’ mi=’NP00000’> </NODE>

</CHUNK><CHUNK ord=’4’ type=’F-term’ si=’modnomatch’><NODE ord=’6’ alloc=’32’ form=’.’ lem=’.’ mi=’Fp’> </NODE>

</CHUNK></CHUNK></SENTENCE>

Future work: integration with Matxin

I We would like to get CG dependency information into aMatxin-compatible format.

I Apertium’s CG would handle analysis while Matxin handles thetransfer step. Eg. given the following analysis (Faroese):

"<Í>""í" Pr @ADVL> #1->3

"<upphavi>""upphav" N Neu Sg Dat Indef @P< #2->1

"<skapti>""skapa" V Ind Prt Sg @VMAIN #3->0

"<Gud>""gudur" N Msc Sg Acc Indef @<SUBJ #4->3

"<himmal>""himmal" N Msc Sg Acc Indef @<OBJ #5->3

Future work: integration with Matxin

I ...we would like to get this dependency tree structure:

</NODE></SENTENCE>

I and let Matxin do reordering and other transfer operations

Thanks for listening!

Licences

This presentation may be distributed under the terms of the GNU GPL,GNU FDL and CC-BY-SA licences.

I GNU GPL v. 3.0http://www.gnu.org/licenses/gpl.html

I GNU FDL v. 1.2http://www.gnu.org/licenses/gfdl.html

I CC-BY-SA v. 3.0http://creativecommons.org/licenses/by-sa/3.0/

Constraint Grammar and Apertium

Technology

Jonas Kuhn & Stefan Müller Grammar Development in Constraint ...stefan/PS/konstanz2001-slides.pdf · Jonas Kuhn & Stefan Müller Grammar Development in Constraint-BasedFormalisms

Constraint programming - Tepper Business Schoolpublic.tepper.cmu.edu/jnh/cp-hb.pdf · logic programming, constraint logic programming, concurrent constraint programming, constraint

Constraint Programming (CP) - SINTEF · Constraint Programming History Constraint ... Constraint Programming (CP) ... Constraint Programming one of the basic technologies for constructing

Foundations of Constraint programming and CONstraint Logic ...ipivkina/Compulog/brent.pdf · 8/20/2008 1 FOUNDATIONS OF CONSTRAINT PROGRAMMING AND CONSTRAINT LOGIC PROGRAMMING kvenable@math.unipd.it

Constraint Lingo: towards high-level constraint programmingvl/teaching/lbai/Constraint... · 2011. 4. 17. · Constraint Lingo: towards high-level constraint programming RaphaelFinkel∗,†,

Open-source machine translation for Icelandic: the Apertium platform as an opportunity

Lexical Functional Grammar - University of Essex · Nontransformational, constraint-based theories (Lexical Functional Grammar, Head-Driven Phrase Structure Grammar, Construction

On how to write rules in Constraint Grammar (CG-3) - VISL - …visl.sdu.dk/~eckhard/powerpoint/CG3_Nodalida_dis.pdf · · 2013-05-28On how to write rules in Constraint Grammar (CG-3)

Head-Driven Phrase Structure Grammar Parsing on Penn Treebank · 2019-07-15 · Head-driven phrase structure grammar (HPSG) is a highly lexicalized, constraint-based grammar de-veloped

11 November 2005 Foundations of Logic and Constraint Programming 1 Constraint (Logic) Programming An overview Examples of Constraint Domains Constraint

XML DATA CONSTRAINT AND XINCAML - scitepress.org · Integrity constraint (identity constraint): This kind of constraint describes the reference relationship between elements or attributes

Efficient Evaluation and Learning in Multilevel Parallel ... · PDF filein Multilevel Parallel Constraint Grammars ... and learning with a toy grammar for a traditional case that

Sd llod-15 apertium

Constraint Conjunction versus Grounded Constraint

Logics in Constraint-Based Lexicalized Grammar (CBLG ...staff.math.su.se/palmgren/logling/LingGenInCBLG-Handout.pdfThe purpose of this overview of CBLG is to introduce major linguistic

Applying Constraint Grammar to Tibetan NLPeprints.soas.ac.uk/21973/2/Garrett-slides.pdfdependency, valency, case roles, semantic type etc. Each rule either adds, removes, selects or

Constraint Networksprofs.sci.univr.it/~posenato/courses/temporalConstraints/slides... · Constraint Networks

A Constraint-based Grammar Approach to Japanese Sentence Processing …cl.naist.jp/thesis/dthesis-ohtani.pdf · A Constraint-based Grammar Approach to Japanese Sentence Processing:

Constraint Propagation: The Heart of Constraint …CS.pdf · Constraint Propagation: The Heart of Constraint Programming ... – specify solutions by posting constraints ... A binary

Grammar-Based Automated Music Composition in Haskell · Composition System Overview Grammar Generative Algorithm Abstract/Structural Generation Abstract Chord Progressions Constraint