Upload
yon
View
28
Download
3
Embed Size (px)
DESCRIPTION
A roadmap for MT : four « keys » to handle more languages, for all kinds of tasks, while making it possible to improve quality (on demand). International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002 - PowerPoint PPT Presentation
Citation preview
A roadmap for MT : four « keys »
to handle more languages, for all kinds of tasks,
while making it possible to improve quality (on demand)
International Conference on Universal Knowledge and Language
(ICUKL2002), Goa, 25-29 November 2002
Christian BoitetGETA, CLIPS, IMAG, 385 av. de la bibliothèque, BP 53
F-38041 Grenoble cedex 9, [email protected], http://clips.imag.fr/geta
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 2/30
Outline
• Basic conceptsWhat is MT ?
Goals: Quality / User
Architectures: Vauquois' triangle
• State of the artMT of texts: examples, problems
MT of spoken dialogs
• The future of MTGoals
4 keys
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 3/30
What is M(a)T ?
• At least 3 types of automationMT = Machine Translation
MAT = Machine Assisted Translation
MAHT = Machine Aided Human Translation
• A scientific technologyInformatics (computer science)
Linguistics
Mathematics
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 4/30
Goals: Quality / User
User
Quality
linguisticallynaive
linguisticallyspecialized
rough, quick
MT for access
special fields :atom, chemistry…
general information
MT fortranslators
helps: lexicons,proposals from a
translation memory…
from raw tovery good
MT forindividual
authorswith interactivedisambiguation
MT for revisors(posteditors)
raw MT, polishable
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 5/30
Architectures: Vauquois' triangle
Deep understanding levelInterlingual level
Ascending transferLogico-semantic level
Syntactico-functional level
Morpho-syntactic levelSyntagmatic level
Graphemic level Direct translation
Syntactic transfer (surface)Syntactic transfer (deep)
Conceptual transferSemantic transferMultilevel transfer
Ontological interlinguaSemantico-linguistic interlingua
SPA-structures (semantic& predicate-argument)
F-structures (functional)C-structures (constituent)
Tagged textText
Mixing levels Multilevel description
Semi-direct translationDescending transfers
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 6/30
Architekturen: Vauquois Dreieck (größer)
Deep understanding levelInterlingual level
Ascending transferLogico-semantic level
Syntactico-functional level
Morpho-syntactic levelSyntagmatic level
Graphemic level Direct translation
Syntactic transfer (surface)Syntactic transfer (deep)
Conceptual transferSemantic transferMultilevel transfer
Ontological interlinguaSemantico-linguistic interlingua
SPA-structures (semantic& predicate-argument)
F-structures (functional)C-structures (constituent)
Tagged textText
Mixing levels Multilevel description
Semi-direct translationDescending transfers
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 7/30
Formal intermediate structures
Linguisticlevel(s)
Linguisticmain
organization
Geometricalstructure
Algebraicstructure
CorrespondenceStructure—Text
Scope
Surface
Deep
1-level
n-level
Syntagms(constituents)
Dependencies
Logical andsemanticrelations
String
Chain graph(chart)
Tree structure
Graph /Network
Hypergraph
LabelsStruct. string
Booleanfeatures
Structuredattributes
Featurestructures
concrete(text ≈ readablefrom structure)
abstract(e.g. UNL)
Sentence(almost all)
Paragraph
Page(Ariane-G5,
Sygmart)
Document
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 8/30
How to produce an MT system
• Choose an architecture
• Program the "tools"Spezialized languages for linguistic programming (SSLP)
Development environment (MT shell)
• Build the "lingware"Lexical data / rules / weights
Grammatical data / rules / weights
Possible specialization to a typology ("sublanguage")
• How?Human work ± computer help / support
Automatic learning (weights, likeliness…)
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 9/30
State of affairs
• only a small number of language pairs is covered by MT systems designed for information accessSystran EC (2000): 19/110 language pairs, 8 OK for intended use
See also examples by Ronaldo Martins
• even fewer are capable of quality translation or speech translation
• Now a few examples…
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 10/30
Examples: MT for access, Web (1)ENGLISH (human version) FRENCH (human version) ENGLISH (Systran FRE-ENG
version)The European-Heritage.netthesaurus covers the fields ofarchaeology and architecture asdefined in the Council of Europeconventions signed in Granada(1985) and Malta (1992).
Le thesaurus European-Heritage.netcouvre les champs de l'archéologie etde l'architecture au sens desconventions du Conseil de l'Europe deGrenade (1985) et de Malte (1992).
The European-Heritage.net thesauruscovers the fields of archaeology andarchitecture within the meaning ofconventions of the Council of Europeof Grenade (1985) and Malta (1992).
It encompasses informationranging from the partnersinvolved, categories of culturalassets and legislation, to activities,skills and funding. It issupplemented by a number ofspecific thesauruses compiled byeach member state on a particulartopic, such as the thesaurus onAndalusian heritage or thearchitectural thesaurus from theMérimée database in France.
Il prend en compte des aspects aussivariés que les acteurs, les catégoriesde biens culturels, la législation ouencore les interventions, les métiers etles financements. Il est complété etprolongé par des thesaurus spécifiquesdéveloppés par chaque Etat membresur tel ou tel sujet spécifique, commele thesaurus du patrimoine historiqueandalou ou le thesaurus d'architecturede la base de données documentaireMérimée en France.
It takes into account aspects as variedas the actors, the categories of culturalgoods, the legislation or theinterventions, the trades and thefinancings. It is supplemented andprolonged by thesaurus specificdeveloped by each Member State onsuch or such specific subject, like thethesaurus of the Andalusian historicalinheritance or the thesaurus ofarchitecture of the documentation database Mérimée in France.
This new, open-ended search toolwill come on line shortly, togetherwith a management andadministration system sharedamong the various contributors.
Cet instrument de recherche,forcément évolutif, sera misprochainement en ligne accompagnéd'un dispositif de gestion etd'administration réparti entre lesdifférents contributeurs.
This instrument of search, inevitablyevolutionary, will be put soon on lineaccompanied by a device ofmanagement and administrationdistributed between the variouscontributors.
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 11/30
GERMAN (Systran ENG-GER version) GERMAN (Systran FRE-GER version)Der European-Heritage.netthesaurus umfaßt dieFelder von archaeology und von Architektur,wie in den Europaratvereinbarungen definiert,die in Granada (1985) unterzeichnet werden undin Malta (1992).
Der European-Heritage.net-Thesaurus bedecktdie Felder der Archäologie und der Architekturim Sinne der Übereinkommen des Europaratsvon Granada (1985) und von Malta (1992).
Er gibt die Informationen um, die von denbetroffenen Partnern, von den Kategorien derkulturellen Werte und der Gesetzgebung, bis zuAktivitäten, von den Fähigkeiten und von derFinanzierung reichen. Er wird durch eine Anzahlvon den spezifischen Thesauren ergänzt, diedurch jeden Mitgliedsstaat auf einembestimmten Thema, wie dem Thesaurus aufAndalusian Erbe oder dem architektonischenThesaurus von der Datenbank Mérimée inFrankreich kompiliert werden.
Er berücksichtigt Aspekte dermaßen variierte,daß die Beteiligten, die Kategorien kulturellerGüter, die Gesetzgebung oder noch dieInterventionen, die Berufe und dieFinanzierungen. Er wird vervollständigt undwird durch ein spezifische Thesaurus entwickeltdurch jeder Mitgliedstaat über das eines oderandere spezifische Thema verlängert, als derThesaurus des andalusischen historischenKulturgutes oder der Thesaurus der Architekturder urkundlichen Datenbank Mérimée inFrankreich.
Dieses neue, offene Suchhilfsmittel kommt aufZeile kurz, zusammen mit einem Management-und Leitungssystem, das unter denverschiedenen Mitwirkenden geteilt wird.
Dieses notgedrungen entwicklungsfähigeForschungsinstrument wird gestellt demnächstonline begleitet von einer Verwaltungs- undVerwaltungsvorrichtung, die aufgeteilt unter denverschiedenen Beitragenden.
Examples: MT for access, Web (2)• FE quite "easy", compared with EG and mainly FG
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 12/30
Comparison: raw vs rough MT SpanAm raw Spanish-English output Reverso raw Spanish-English outputMessage of the Director-General of the WorldHealth Organization
Message of the Chief operating officer of the WorldOrganization of the Health
From its discovery, antibiotics have completelytransformed the perspective of humankind withrespect to infectious diseases. Today the use ofantibiotics, combined with improvements insanitation, housing, and nutrition, together with theadvent of the vaccination programs generalized,have caused a notable reduction of infectiousdiseases that previously were common andannihilated entire populations.
From his{*its*} discovery, the antibiotics have transformedcompletely the perspective of the humanity with regard tothe infectious diseases. Today the use of the antibiotics,cocktail with improvements in the reparation, the housingand the nutrition, together with the advent of the programsof widespread vaccination, they have given place to anotable decrease of infectious diseases that before werecommon and were annihilating entire populations.
Scourges that terrified millions of people, as plague,whooping cough, poliomyelitis, and the scarlatina,have been controlled or are on the verge of beingcontrolled. Now, in the dawn of a new millennium,humankind faces another crisis. Previously curablediseases as the gonorrhea and typhoid fever arebecoming rapidly difficult to treat, while oldassassins as tuberculosis and malaria now are armedof the increasingly impenetrable resistance to theantimicrobial drugs.
Scourges that terrified million persons, as the pest, thesavage cough, the poliomyelitis and the scarlatina, they havebeen controlled or are on the verge of be controlling. Now,in the dawn of a new millenium, the humanity faces withanother crisis. Diseases before curable as the gonorrhea andthe fever tifoidea they are becoming rapidly difficult totreat, whereas killer old men as the tuberculosis and themalaria are armed{*assembled*} now with the increasingimpenetrable resistance the antimicrobial ones.
This phenomenon is potentially contenible. Theproblem is increasingly profound and complex,accelerated by the abuse of antibiotics in thedeveloped countries and the paradoxicalunderutilization of the quality antimicrobial drugs inthe developing countries due to the poverty and tothe scarcity resulting from an effective health care.
This phenomenon is potentially contenible. The problem isincreasingly deep and complex, accelerated by the abuse ofthe antibiotics in the developed countries and theparadoxical subutilization of the antimicrobial ones ofquality in the countries in development due to the povertyand the resultant shortage of an attention of effective health.
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 13/30
Examples: MT for revisors…
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 14/30
…with BV-aero/FE (2)
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 15/30
MT of spoken dialogs
• Specialized systems are already usable e.g. ATR/Matsushita, IBM, CSTAR/Nespole!…
Much "noise" and "ungrammaticalities"
But specializing is very helpful!
• General systems are also possible e.g. NEC/Xroad, Linguatec/Talk&Translate
Speech recognition is already good enough
Rough may be good enough (e.g. for chatting)
• Interpretation is different from translation……and participants are intelligent !
Similarity with access-oriented-MT
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 16/30
French-Korean through IF (1)
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 17/30
French-Korean through IF (2)
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 18/30
French-Korean through IF (3)
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 19/30
A road map… to which goals?
• MT of adequate quality
• Not only for access
• For all languages
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 20/30
Four keys
• 2 on the technical side
• 2 on the organizational side
Compromize: a far wider coverage, a somewhat smaller asymptotic quality
• Automatic learning techniques
• Using non-textual pivots (intermediate formal descriptors)
Democratization, cooperation
• Cooperative development of open source linguistic resources on the Web
• Towards systems where quality can be improved "on demand" by users
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 21/30
Learning techniques
• Extend the use of hybrid techniquessymbolic, numerical, or mixed
==> they have demonstrated their potential at the research level
• stochastic grammars
• weighted (or "neural") dictionaries
• or build new tools, intrinsically numericalinspiration from voice recognition
• 2 exampleslearning analyzers : text —> semantic tree (IBM)
learning implicit very detailed DG from tree bank (NAIST)
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 22/30
Using non-textual pivots
• Semantico-pragmatic (ontological) pivotstask & domain oriented ==> limited applicability
• Abstract linguistic descriptorsthe most precise, but often too sophisticated
depend on each language
• Anglo-semantic pivot: UNL"the HTML of linguistic content"
• in UNL, a hypergraph represents the abstract structure of (supposedly) equivalent English utterance
less precise but "robust"
symbols constructed from English ==> usable by all developers
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 23/30
score(icl>event,agt>human,fld>sport).@entry.@past.@complete
pos head(pof>body).@def
objagt
Ronaldo(icl>proper noun)
ins plt
goal(icl>abstract thing)
left(aoj<thing)
posmod
corner(icl>thing).@def
goal(icl>concrete thing)
A simple UNL graph
•Ronaldo has headed the ball into the left corner of the goal
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 24/30
Cooperative development
• of open source linguistic resources
• on the WebMutualization is necessary at least for lexical knowledge
too costly even for the leaders
size (#entries) has to augment for each language (300K, 3M?)
#languages has to increase dramatically (11 —> 20 —> 180?)
Integration of human- and machine-oriented knowledge is useful
e.g. to produce mixed MT/MAHT systems
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 25/30
A contribution: the Papillon project
• Goal: produce many open source dictionaries from a central lexical data base
• Means:build rich (DiCo) monolingual dictionaries of lexies (senses)interlink lexies by interlingual links (axies)use XML & associated tools as basis to generate many formats
• for humans and for machinesstart from (free) digital resourcesinduce "consumers" to become "producers" (contributors)
• Quality control:private accountscentral validating/integrating group
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 26/30
Lexical
Database
Papillon database macrostructureUser User User
Dictionary Dictionary
Resource Resource Resource
Interaction withthe Dictionaries
Extraction ofDictionaries
Integration of existing resources
Human Contributors
27/30 ICUKL2002, Goa, 25-29/11/2002 Ch. Boitet
Interlingual links based on translations = "AXIEs"
Possibility to link 1 lexie with >1 acceptions
References to other semantic systems: AXIE—1————n—>UW
PAPILLON diagramFrench. DiCo
Vocable carte n.f.
Lexie carte.1 carte à jouer
Lexie carte.2 carte géographique
Japan. DiCo
地図
カードAcception 343
UNL: card(icl>play),card(icl>thing)…
Acception 345
UNL: map(fld>geography)
Interlingual links
Acception 1002
UNL: card(fld>money)
a
Thai DiCo
Engl. DiCo
Vocable card N
Lexie card.1 playing card
Lexie card.2 money card
Vocable=lexie map
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 28/30
Construct systems where quality can be improved "on demand" by users
• a priori through interactive disambiguation in the source language
• or a posteriori by correcting the pivot representation (UNL or other) through any language (as in MultiMeteo)
==> In the 2 cases, all versions (in all languages) are improved
• possibility to merge MT
multilingual generation
computer-aided authoring
Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 29/30
Conclusion
• 4 keys to open the door to MT of adequate quality to all languages
• On the technical side, dramatically increase the use of learning techniquesuse pivot architectures, the most universally usable pivot being UNL
• On the organizational side,cooperatively develop open source linguistic resources on the webconstruct systems where quality can be improved "on demand" by users
• On the practical side, seek keys to unlock private investment, public funding, voluntary
cooperationcould this conference become a decisive turning point?