View
715
Download
0
Category
Tags:
Preview:
DESCRIPTION
Apertium is a free and open source MT platform, where both the linguistic data and engines are under free licences. Constraint Grammar is used for pre-disambiguation in several language pairs.
Citation preview
CG in Apertium
Kevin Brubeck UnhammerUniversity of Bergen, Norway
14th May 2009
What is Apertium?
I An Open Source Machine Translation platformI both source code and data have Free / Open Source licences
I ModularI stand-alone programs communicate through standard Unix pipesI particular language pairs need not use all modules!
I Developed by universities, companies and independent(volunteer and paid) developers
History of Apertium
I Initially developed for closely related languages (Portuguese↔Spanish↔ Catalan) by the Transducens group at the Universitatd’Alacant
I Later extended to allow more distant language pairs
I Now also involves various companies in Spain, the universities ofVigo, Reykjavík, Oviedo, Barcelona (Pompeu Fabra), etc.
Language pairs
I “Stable”: Spanish↔ Catalan, Spanish← Romanian, French↔Catalan, Occitan↔ Catalan, English↔ Galician, Occitan↔Spanish, Spanish↔ Portuguese, English↔ Catalan, English↔Spanish, English→ Esperanto, Spanish↔ Galician, French↔Spanish, Esperanto← Spanish, Welsh→ English, Esperanto←Catalan, Portuguese↔ Catalan, Portuguese↔ Galician,Basque→ Spanish
I Other pairs being developed (Spanish↔ Asturian, Icelandic↔English, Swedish↔ Danish, Nynorsk↔ Bokmål, . . . )
Marginalised
Few free resources
Copious free resources
Modules
I Morphological dictionariesI lttoolbox: XML format, compiles to FSTs
I Fast (seems to perform 5x faster than SFST)I one dictionary gives both analysis and generation
I CG pre-disambiguation
I Statistical disambiguation (HMM)
I Bilingual dictionary for lexical transfer
I Shallow syntactic transfer rules
I Local re-ordering (nom adj→ adj nom)I Chunking (adj adj nom→ SN[adj adj nom])I Insertions, deletions and substitutions of lexical units and chunks
A sketch of the architecture
The Apertium Stream Format
I Simple example from Norwegian BokmålI “lese en” (‘read a/one’)I Morphological analysis gives:
^lese/lese<vblex><inf>$ ^en/en<num><sg><mf>/ene<vblex><imp>/en<det><ind><mf><sg>$
I After CG:
^lese/lese<vblex><inf>$ ^en/en<num><sg><mf>/en<det><ind><mf><sg>$
I Formatting information (like HTML tags) is saved in superblanksmaking document and web translation easy
I original:
Kva er det du <em>seier</em>?I deformatted:
Kva er det du[ <em>]seier[<\/em>]?
Visualising the process helps find errors
The platform provides
I a language-independent machine translation engine
I tools to manage the linguistic data necessary to build a machinetranslation system for a given language pair
I little programming knowledge required to get startedI graphical user interfaces that show each step in the translation
processI many more advanced tools (for eg. merging or sorting
dictionaries)
I linguistic data for a growing number of language pairs
I also usable for other NLP purposes (spelling & grammar checking,. . . )
CG in Apertium
I Used after morphological analysis for pre-disambiguation inNynorsk↔ Bokmål, Welsh↔ English, Breton↔ French, Irish↔Scottish Gaelic
I Apertium’s own statistical disambiguator makes a choice if CGdoesn’t completely disambiguate
CG in Apertium
I Norwegian CG is from the Oslo-Bergen Tagger (GPL)
I Sámi giellatekno provides Free grammars for Sámi languagesand Faroese
I Irish grammar mostly converted manually from the An Gramadóirproject (GPL)
I Other grammars made solely by Apertium members
Some statistics
Sections Rules Sets Tags
Welsh 2 98 141 128Breton 4 121 125 154Irish 1 285 298 292
Table: Rule counts for some of the CG grammars in Apertium
Same concepts apply between modules
CG Apertium/lttoolbox Apertium stream formatwordform surface form booksbaseform lemma bookcohort ambiguous lexical unit ^books/book<n><pl>
/book<vblex><pres><p3><sg>$reading analysis /book<n><pl>/
Table: Terminology differences
Same format readable by all modules
I Both SFST/HFST and vislcg3 read and write the Apertium streamformat.
I Example from the Open Morphology of Finnish, output by theApertium reader in SFST/HFST:
^kaikki/kaikki<noun><7><a><sg><nom>$^ihmiset/ihminen<noun><38><pl><acc>/ihminen<noun><38><pl><nom>$^syntyvät/syntyä<verb><52><j><act><pcpva><pl><acc>/syntyä<verb><52><j><act><pcpva><pl><nom>/syntyä<verb><52><j><act><indv><pres><pl3>$^vapaina/vapaa<noun><17><pl><ess>$ ^ja/*ja$^tasavertaisina/*tasavertaisina$^arvoltaan/arvo<noun><1><sg><abl><pl3>/arvo<noun><1><sg><abl><sg3>$^ja/*ja$^oikeuksiltaan/oikeus<noun><40><pl><abl><pl3>/oikeus<noun><40><pl><abl><sg3>$
Why Apertium
I Rule-based MT
I most languages of the world have little freely available textualdata, let alone parallel corpora for SMT purposes; Apertium isthus suitable for marginalised languages
I Rule-based systems are linguistically interesting, and provide testbeds for linguistic theory
I Reuse and Interoperability
I Monolingual dictionaries and constraint grammars are directlyreusable for new language pairs
I apertium-dixtools: generates new language pairs from existingones
I vislcg3 reads and outputs the Apertium stream format, as doStuttgart/Helsinki Finite State Tools
I Free licences allow other systems to use Apertium data and tools
Why Apertium
I Open Source + fairly simple learning curve = great potential forcontributors
I Eg. Jacob Nordfalk: entered Apertium last fall, had English→Esperanto pair by March 2009
I Very helpful and accessible community
Future work: dependency-based reordering in Apertium
I Currently, CG is only used for disambiguation
I Many constraint grammars out there give dependencyinformation, this could be integrated into Apertium to providedependency based reordering, simplifying the transfer step
Future Work: integration with Matxin
I Matxin is a Free Software sister project of Apertium whichcurrently uses FreeLing for dependency analyses:
<SENTENCE ord=’1’><CHUNK ord=’2’ type=’grup-verb’ si=’top’>
<NODE ord=’4’ alloc=’19’ form=’sacude’ lem=’sacudir’ mi=’VMIP3S0’> </NODE><CHUNK ord=’1’ type=’sn’ si=’subj’><NODE ord=’3’ alloc=’10’ form=’atentado’ lem=’atentado’ mi=’NCMS000’>
<NODE ord=’1’ alloc=’0’ form=’Un’ lem=’uno’ mi=’DI0MS0’> </NODE><NODE ord=’2’ alloc=’3’ form=’triple’ lem=’triple’ mi=’AQ0CS0’> </NODE>
</NODE></CHUNK><CHUNK ord=’3’ type=’sn’ si=’obj’><NODE ord=’5’ alloc=’26’ form=’Bagdad’ lem=’Bagdad’ mi=’NP00000’> </NODE>
</CHUNK><CHUNK ord=’4’ type=’F-term’ si=’modnomatch’><NODE ord=’6’ alloc=’32’ form=’.’ lem=’.’ mi=’Fp’> </NODE>
</CHUNK></CHUNK></SENTENCE>
Future work: integration with Matxin
I We would like to get CG dependency information into aMatxin-compatible format.
I Apertium’s CG would handle analysis while Matxin handles thetransfer step. Eg. given the following analysis (Faroese):
"<Í>""í" Pr @ADVL> #1->3
"<upphavi>""upphav" N Neu Sg Dat Indef @P< #2->1
"<skapti>""skapa" V Ind Prt Sg @VMAIN #3->0
"<Gud>""gudur" N Msc Sg Acc Indef @<SUBJ #4->3
"<himmal>""himmal" N Msc Sg Acc Indef @<OBJ #5->3
Future work: integration with Matxin
I ...we would like to get this dependency tree structure:
<SENTENCE ord="1"><NODE form=’skapti’ lem=’skapa’ ord=’3’ mi=’V.Ind.Prt.Sg’ si=’VMAIN’><NODE form=’Í’ lem=’Í’ ord=’1’ mi=’Pr’ si=’ADVL’>
<NODE form=’upphavi’ lem=’upphav’ ord=’2’ mi=’N.Neu.Sg.Dat.Indef’ si=’P’/></NODE><NODE form=’Gud’ lem=’Gud’ ord=’4’ mi=’N.Prop.Sg.Nom’ si=’SUBJ’/><NODE form=’himmal’ lem=’himmal’ ord=’5’ mi=’N.Msc.Sg.Acc.Indef’ si=’OBJ’/>
</NODE></SENTENCE>
I and let Matxin do reordering and other transfer operations
Thanks for listening!
Licences
This presentation may be distributed under the terms of the GNU GPL,GNU FDL and CC-BY-SA licences.
I GNU GPL v. 3.0http://www.gnu.org/licenses/gpl.html
I GNU FDL v. 1.2http://www.gnu.org/licenses/gfdl.html
I CC-BY-SA v. 3.0http://creativecommons.org/licenses/by-sa/3.0/
Recommended