44
1 Towards OpenLogos Hybrid Translation Anabela Barreiro INESC-ID [email protected]

Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

Embed Size (px)

DESCRIPTION

OpenLogos open source machine translation - the ideal platform for a hybrid machine translation solution

Citation preview

Page 1: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

1

Towards OpenLogos Hybrid Translation Anabela Barreiro

INESC-ID [email protected]

Page 2: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

2

Research goals

– OpenLogos – 1st hybrid open source machine translation solution

– Hybridization of the OpenLogos system consists on embedding linguistic

knowledge into statistical machine translation (SMT)

The timing is just right…

– Recognition by SMT researchers and developers of the need to integrate

linguistic knowledge in machine translation (MT) systems

– Benefit from cloud computing, big data and advanced alignment techniques,

which contribute to an easier and faster development of new language pairs

– Use crowd sourcing support to increase MT quality

Introduction with Contextual Information

Page 3: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

3

The ideal platform for hybrid translation

– Logos legacy (one of the first RBMT systems - 1970)

– Logos Corporation – one of the longest run commercial MT companies in the

world (in business for over 30 years)

– The Logos MT product put its emphasis on semantic understanding

– The Logos approach was through linguistic analysis of English to render it in a

form that was “understood” by the computing system

– To a certain extent, the Logos approach is similar in spirit to the SMT approach,

and complements SMT by providing answers that help overcome statistical

weaknesses

Introduction with Contextual Information

Page 4: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

4

The open source initiative

– OpenLogos is publicly available as open source software

– It has some enthusiastic advocates and fervent supporters in different parts of the

world who believe that:

• OpenLogos will be used as the rule-based component of a new linguistically

enhanced hybrid translation system

• The open source components of the OpenLogos will help the NLP/CL research

community make scientific advances

Introduction with Contextual Information

Page 5: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

5

Background on OpenLogos MT

System pipeline architecture

SAL representation language

Classic problems with rule-driven systems

How SAL benefits translation

Advantages of the OpenLogos architecture

Uniqueness of the OpenLogos MT system

Exploiting OpenLogos resources for new applications

Availability of OpenLogos free resources

Presentation Outline

Page 6: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

Open source copy of the Logos system (1970-2001) adapted by DFKI

– Developed in US, Germany, Italy

– 25-100 development staff for 30 years

– + 80 million US Dollar Investment

8 language pairs: EN-GE, EN-FR, EN-ES, EN-IT, EN-PT

GR-EN, GE-FR, GE-IT

Commercial product was considered high quality

Industrial strength MT used successfully in 12 countries

Users included: Ericsson of Sweden, the Canadian Secretary of State, SAP,

Siemens-Nixdorg, Oce Netherlands, and Union Fenosa

6

Background to OpenLogos

Page 7: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

Multi-target System

– One source language analysis can generate any number of targets

Pipeline Architecture

Language-neutral Software

– All linguistic knowledge is in data files, stored in a relational database

Semantico-Syntactic Abstraction Language (SAL Representation)

– Taxonomy-ontology

– NL sentences entering the system are immediately converted into SAL sentences

– SAL is the driving force of the OpenLogos process

Semantic Processing

– Semantic Table (= SEMTAB) containing thousands of transformation rules

7

OpenLogos Characteristics

Page 8: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

OpenLogos Pipeline Architecture

8

Format

RES1 RES2

P1 P2

P3 P4

S T4

T3

T1 T2

GEN

Format

SEMTAB

Target Rules SEMTAB

SEMTAB

SAL Rules

Target Rules

Target Rules

• Highly Modular

• Incremental Processing

• Multi-Target System

• Bottom-up Analysis

• Deterministic Parse

Input

Output SEMTAB

Page 9: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

9

Clause Segmentation ways of cooking lentils - V

Homograph Resolution types of [cooking utensils] - ADJ

Deterministic parsing requires that all ambiguous PoS be resolved (98% precision)

Format

RES2

RES1

SAL Rules

SEMTAB

Enter Pipeline

Incremental Source Analysis - 1

Page 10: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

10

Parse1

Parse3

Parse4

S

Parse2

• Simple NP • Semantic

resolution • NP Prep NP

• Relative clauses

• Semantic resolution

• Verb semantics

•Complex NP • Simple clauses

• Semantic resolution

•Order in complex

sentences • Semantic

resolution

SAL Rules Semtab

Incremental Source Analysis - 2

E.g: a book on the presidency

on = about; concerning

≠ a book on the table

on = over 10

Page 11: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

SAL - Semantico-syntactic Abstraction Language

SAL Taxonomy: 3 levels organized hierarchically

– Supersets / Sets / Subsets

Semantico-Syntactic continuum from NL word to Word Class

– Literal word: airport

– Head morph: port

– SAL Subset: Agfunc (agentive functional location)

– SAL Set: func (functional location)

– SAL Superset: PL (place)

– Word Class: N

Both Pipeline Input Stream and Rulebases are expressed in SAL

11

SAL Representation Language

Page 12: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

12

SAL Noun Supersets

E.g: two pieces of cake

NP parse must have:

- Plural morphology of pieces

- Semantics of cake

Developed:

- inductively

- by trial and error

- over a period of years

- by the development team

Page 13: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

13

Abstract Noun Taxonomy

Abstract Noun Superset

Non-verbal Abstract Set

Non-verbal

Subsets

Verbal Abstract Set

Verbal

Subsets

Classifications

Methods / Procedures

Page 14: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

14

Is the word cooking a verb or an adjective?

ways of cooking lentils

types of cooking utensils

ways N(AB/method) parser verb bias

types N(AB/class) non-verb bias

Use of SAL Codes to Resolve Homographs

SAL contributes to the resolution of the homograph

The SAL code N(AB/method) in the rule

matches on a similar code in the SAL input

stream.

The effect of such a match is to resolve

cooking as a verb

Page 15: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

Rules Have Five Components

SAL Pattern

– PARSE2 example: N(IN/data;u) Prep(“on”;u) N(u;u) (a book on the presidency)

Constraints

– Match only if conditions are true or false

Source Actions

– RES Rulebase: Resolves syntactic ambiguity

– PARSE Rulebase: Creates parse tree

– SEMTAB Rules: Effects semantic disambiguation

Target Action (optional)

– Effects syntactic and/or semantic transfer

Comment Line

– PARSE2 example: NP(info) Prep(“on”) NP N1 “about” N2

E.g., book on political satire book about ....

15

What SAL Rules Look Like

Page 16: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

Complexity

– Logic saturation

– Rulebase grows too large

– Performance degradation

– Difficult maintainability

– System improvability stasis

Ambiguity

– Quality/accuracy of output – depends on effective disambiguation

– Effective disambiguation cause rulebase growth

Classic Dilemma of the Developer

– Reduce rulebase size to relieve complexity weakens disambiguation

– Increase rulebase size to address ambiguities increases complexity

16

Classic Problem of RBMT

Page 17: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

17

Complexity

– Rules and input stream are expressed as SAL patterns

– Homogeneous ‘apples-to-apples’ matching

– Rules are SAL patterns stored/organized in an indexed pattern dictionary

– SAL input stream serves as search argument to SAL rulebase

– No limit on rule size and no impact on performance

– Rules are self organizing

– Rulebase is easy to maintain

How OpenLogos Addresses Complexity and

Ambiguity

Page 18: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

18

How Rules Are Applied

Metaphor: biological neural net

– Vectors labeled V1-V6 = SAL input stream of the pipeline

– Cells in input vectors = SAL elements/words to which the NL input stream has been

converted

– In this network, R1 through P4 = hidden layers containing SAL rules

– R1 represents RES1, P1 represents Parse1 and so on.

– Each hidden layer contains between 2-4 thousand rules, organized by their SAL

pattern, as in a dictionary.

As the analysis progresses:

1- cells become fewer

(abstract nature of the

parse)

2- vectors become lighter

(semantic dismbiguation)

Page 19: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

19

Chief similarity

– Efficient interaction between the SAL input stream and the rules of the

hidden layers

– Only those rules which should be looked at are accessed

– The developer does not need to develop metarules or discrimination

networks to achieve efficiency in rule matching

– Efficiency in rule matching is an automatic by-product of system design

How Rules Are Applied

Metaphor: biological neural net

Page 20: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

20

Ambiguity

– Syntactic Homograph Resolution

– Scoping of adjectives, prepositions

– Polysemy

How OpenLogos Addresses Complexity and

Ambiguity

Page 21: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

21

Resolution of Polysemy in OpenLogos

SAL Representation Language in interaction with SEMTAB

SEMTAB provides a transfer that overrides the default dictionary transfer

for the verb “raise”

NL String SEMTAB Rule Portuguese Transfer

raise a child V(‘raise’) N(ANdes) criar. . .

raise corn V(‘raise’) N(MAedib) cultivar. . .

raise the rent V(‘raise’) N(MEabs) aumentar. . .

Page 22: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

22

Deep Structure Rules of SEMTAB

A single deep-structure rule matches multiple surface-structures

and produces correct target transfers

he raised the rent ele aumentou a renda V+Object

the raising of the rent o aumento da renda Gerund

the rent, raised by … a renda, aumentada por… Part. ADJ

a rent raise um aumento de renda Noun

Page 23: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

23

How SAL Benefits Translation

The situation was alluded to by my friend in his letter

Mon ami a fait allusion à la situation dans sa lettre

The situation was alluded to in their letter

On a fait allusion à la situation dans leur lettre

Examples showing voice transformations

EN passive voice >>> FR active voice

Voice transformations are possible due to: • incremental pipeline approach • strong semantic sensitivity

Page 24: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

24

Creation of systems involving small or neglected/endangered languages

– not targeted by commercial programs

– to fulfil the goals of administrations and NGOs dealing with these

languages, contributing to their promotion and/or revival

Freely available

– any user can access the technology

Customizable - institutions or businesses adopting an open-source MT can

customize the system to their needs in many ways

– developing new linguistic data (vocabularies, rules, corpora)

– integrating system/data with other packages

– etc.

Advantages of OpenLogos

Machine Translation Architecture

Page 25: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

25

Extensible dictionaries with underlying semantic foundation

Analyses whole source sentences, considering:

– Morphology

– Meaning (semantics)

– Grammatical structure and function

Semantico-Syntactic Abstraction Language (SAL)

– the parser is able to achieve better results than syntactic analysis alone

would allow.

Parsing is only source language specific; generation is target language

specific

Originally a transfer approach, evolved to the present system (which has

interlingual features inherent to the system)

OpenLogos Uniqueness

Page 26: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

26

OpenLogos comprehensive analysis permits to construct a complete and

idiomatically correct translation in the target language

OpenLogos is suitable for research and academic use

– make OpenLogos the standard MT platform for universities, education and

other governmental institutions

– bring new life into a dormant technology (Phoenix rising metaphor)

OpenLogos linguistic data representation can be established as the

foundation

– freely available for private and commercial use

– there is still need for the provision of linguistic and technical services

and/or customer support on a fee basis

– packaging OpenLogos with the top five Linux distributions will generate a

constant revenue stream

OpenLogos has an ideal platform for a hybrid MT solution

OpenLogos Uniqueness

Page 27: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

27

SPIDER

– System for Paraphrasing In Document Editing and Revision.

– Based on NooJ’s technology (http://ww.nooj4nlp.net/)

– Publicly available at: http://www.linguateca.pt/ReEscreve/

– Designed to help with writing optimization, but its applicability extends to MT

pre-editing.

1st version – ReEscreve (for Portuguese) and ReWriter (for English)

2nd version – eSPERTo (Portuguese: the smart/clever one; expert)

Designed for integration in a cyber school project within the scope of an

educational program to teach students how to improve their writing skills in

the Portuguese language

EXPERT (prototype) - to assist writing of domain-specific texts

Initially, OpenLogos EN-PT dictionary data were adapted and enhanced with new properties (derivational, etc.) to create a new resource:

Port4NooJ (http://www.linguateca.pt/Repositorio/Port4NooJ/). ReEscreve uses Port4NooJ.

Contribution of OpenLogos Resources for New NLP

Applications

Page 28: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

28

ParaMT

– Bilingual/multilingual paraphraser (translator prototype)

– Uses similar methodology to that employed by SPIDER

– Uses bilingual data

– Directly applicable to MT

Corpógrafo

– Multilingual corpora management tool

– Available at: http://www.linguateca.pt/corpografo/

Contribution of OpenLogos Resources for New NLP

Applications

Page 29: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

30

– Authoring aid (word processing applications)

– Language composition tool

– Text production and style editor

– Empirical testbed for linguistic quality assurance

– Text (pre-)editing (machine translation)

– “Revision memory” tool (≈ “translation memory”)

– Applicable to general and technical language

When integrating terminologies, it helps writing in technical domains

(e.g. student texts - ReWriter or legal texts - EXPERT)

Uses of SPIDER

Page 30: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

31

ReEscreve: Suggestions for Text Rewriting

Paraphrases of SVC presented by ReEscreve’s

paraphrasing system

Page 31: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

32

ReEscreve: a Rewritten Text

Text rewritten based on the user’s preferences

Users can suggest new expressions!

Page 32: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

34

Suggestions for Text ReWriting

Suggestions for general language

linguistic phenomena

Compound adverbs

> single adverbs

Support verb constructions

> single verbs

Relatives > participial

adjectives

Page 33: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

35

Selection of paraphrasing grammars for specific

linguistic phenomena

Users can select among general and technical dictionaries (more than one selection allowed),

grammars for specific linguistic transformations (one, several or all grammars can be selected).

The interface provides sample texts for testing.

Sample LEGAL

text

Informative details about the

linguistic resources selected

Page 34: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

36

Identification of legal terms in the text

Suggestions for the term “breach of

law”

Users can select one term from the list of suggestions or provide a new

suggestion

Selection of a Domain Dictionary

Page 35: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

37

Suggestions provided and user’s capability to add

new rewriting options

Text rewritten

• In red, the expressions in the source text

• In green, suggestions provided by SPIDER and selected by the user

The user can suggest new words or

expressions (synonyms or paraphrases)

It is possible to go back and change the

user option as many times as necessary

Page 36: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

38

Recognition of Portuguese SVC and translation

into English verbs

MACHINE

TRANSLATION

ParaMT: a Paraphraser Applicable to MT

$EN

EN verbs PT support verb construction

>

Page 37: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

39

Selected Publications on Paraphrasing Applications

Anabela Barreiro. "SPIDER: a System for Paraphrasing In Document Editing and Revision -

Applicability in Machine Translation Pre-Editing". Computational Linguistics and

Intelligent Text Processing. Proceedings of the 12th International Conference 6609 (2011),

pp. 365-376. Springer. ISSN: 0302-9743. e-ISSN: 1611-3349. DOI: 10.1007/978-3-642-

19400-9. Part II, Lecture Notes in Computer Science

Anabela Barreiro. "ParaMT: a Paraphraser for Machine Translation". In António Teixeira, Vera

Lúcia Strube de Lima, Luís Caldas de Oliveira & Paulo Quaresma (eds.), Computational

Processing of the Portuguese Language, 8th International Conference, Proceedings

(PROPOR 2008) Vol. 5190, (Aveiro, Portugal, 8-10 de Setembro de 2008), Springer Verlag.

Lecture Notes in Computer Science,pp. 202-211.

Anabela Barreiro & Luís Miguel Cabral. "ReEscreve: a translator-friendly multi-purpose

paraphrasing software tool". In Marie-Josée Goulet, Christiane Melançon, Alain Désilets &

Elliott Macklovitch (eds.),Proceedings of the Workshop Beyond Translation Memories: New

Tools for Translators, The Twelfth Machine Translation Summit (Château Laurier, Ottawa,

Ontario, Canada, 29 August 2009), pp. 1-8.

Page 38: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

40

Anusaaraka group at LTRC, IIIT-Hyderabad

– Integrating OpenLogos in their English to Hindi Language accessor

– An OpenLogos-based English-Hindi MT prototype is already functional,

but needs refinement before release

Chaudhury, S.; Rao, A.; Sharma, D. M. (2010). "Anusaaraka: An Expert System based

Machine Translation System". In Proceedings of 2010 IEEE International Conference on

Natural Language Processing and Knowledge Engineering (IEEE NLP-KE2010), Beijing,

China, Aug 21- 23, 2010.

Kalinga Institute of Industrial Technology, KIIT

– Setting up a research lab with MT based on OpenLogos technology

OpenLogos for Indian Languages

Page 39: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

41

Department of Political, Social and Communication Sciences,

University of Salerno

– PhD dissertation where the OpenLogos English-Italian SEMTAB rules

methodology was applied, supported with the NooJ NLP environment to

represent the theoretical and methodological principles of the Lexicon-

Grammar Theory

Monti, Johanna (2013). Multi-word unit processing in Machine Translation. Developing and

using linguistic resources for multi-word unit processing in Machine Translation

Southern African main universities

– Initial efforts to bring OpenLogos as a MT platform for translation

between English and the African languages (scarce resources, lack of

parallel corpora, etc.) in a initiative similar to that one done for Indian

languages

Other Efforts with OpenLogos

Page 40: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

42

The Language Technology Lab of DFKI has adapted OpenLogos from the

commercial Logos System

Also at Sourceforge under a GPL license

http://openlogos-mt.sourceforge.net/

OpenLogos employs only open source components:

– Use of open source development tools and compilers, such as GCC

– Replacement of non-open code and libraries

– Use of open source databases instead of a commercial database. All

language specific resources have been converted to PostgreSQL

– Use of open standards instead of vendor specific protocols

– As a proof of concept for the software migration, Linux is used as target

platform for the first open source release of Logos

OpenLogos Resources at DFKI

Page 41: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

43

Core code libraries of the server side system and basic executables to start

and run the system (APITest, logos_batch)

Resources, such as analysis (RES) and transfer (TRAN) grammars for

source and target languages, and a multi-language dictionary database

Tools: LogosTermBuilder, User administration (LogosAdmin), Command

line tools (APITest, openlogos), and multi-user GUI for initiating and

inspecting translation jobs and results (LogosTransCenter)

OpenLogos Components

Page 42: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

DFKI hosts an open OpenLogos mailing list dedicated to discussion

and exchange of information concerning OpenLogos developments and

problems at:

http://www.dfki.de/mailman/listinfo/openlogos-list

LinkedIn Discussion Group on OpenLogos Machine Translation

OpenLogos Facebook page

44

DFKI User Assistance with OpenLogos

Page 43: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

45

Selected Publications

A few publications and technical papers are available with description of

the SAL representation language

the system architecture and workflow

Anabela Barreiro, Bernard Scott, Walter Kasper and Bernd Kiefer. OpenLogos Rule-Based

Machine Translation: Philosophy, Model, Resources, and Customization. In Machine

Translation, volume 25 number 2, Pages 107-126, Springer, Heidelberg, 2011. ISSN: 0922-

6567. DOI: 10.1007/s10590-011-9091-z

Bernard Scott and Anabela Barreiro. OpenLogos MT and the SAL Representation Language.

In Proceedings of the First International Workshop on Free/Open-Source Rule-Based

Machine Translation. Edited by Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Francis

M. Tyers. Alicante, Spain: Universidad de Alicante. Departamento de Lenguajes y Sistemas

Informáticos. 2–3 November 2009, pp. 19–26

Bernard Scott. The Logos Model: an Historical Perspective. In Machine Translation, vol. 18

(2003), pp. 1–72.

Page 44: Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

46

Towards OpenLogos Hybrid Translation Anabela Barreiro

INESC-ID [email protected]