22
Enhancing Translation Systems Enhancing Translation Systems with Bilingual Concordancing with Bilingual Concordancing Functionalities Functionalities V. ANTONOPOULOS V. ANTONOPOULOS C. MALAVAZOS C. MALAVAZOS I. I. TRIANTAFYLLOU TRIANTAFYLLOU S S . . PIPERIDIS PIPERIDIS Presentation: V. Antonopoulos Presentation: V. Antonopoulos [email protected] [email protected] Institute for Language and Speech Institute for Language and Speech Processing Processing Workshop on Balkan Language Resources & Tools Workshop on Balkan Language Resources & Tools

Enhancing Translation Systems with Bilingual Concordancing Functionalities V. ANTONOPOULOSC. MALAVAZOS I. TRIANTAFYLLOUS. PIPERIDIS Presentation: V. Antonopoulos

Embed Size (px)

Citation preview

Enhancing Translation Systems with Enhancing Translation Systems with Bilingual Concordancing FunctionalitiesBilingual Concordancing Functionalities

V. ANTONOPOULOSV. ANTONOPOULOS C. MALAVAZOSC. MALAVAZOS

I.I. TRIANTAFYLLOU TRIANTAFYLLOU SS. . PIPERIDISPIPERIDIS

Presentation: V. AntonopoulosPresentation: V. Antonopoulos

[email protected]@ilsp.gr

Institute for Language and Speech ProcessingInstitute for Language and Speech Processing

Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools

Current FrameworkCurrent Framework

Increasing demand for multilinguality, for translationIncreasing demand for multilinguality, for translation

Current translation systems still fail to completely meet Current translation systems still fail to completely meet

the translation needsthe translation needs

Language transfer still prevailing problemLanguage transfer still prevailing problem

Need for further development of existing systemsNeed for further development of existing systems

1.1. Integration of technologies (TM & MT)Integration of technologies (TM & MT)

2.2. Intelligent ToolsIntelligent Tools

Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 2 of 17

Proposed MethodProposed Method

Expands the transfer selection capabilitiesExpands the transfer selection capabilities

Utilizes sub-sentential informationUtilizes sub-sentential information

Performs well when dealing with limited amount of Performs well when dealing with limited amount of

parallel data (Translation Memories)parallel data (Translation Memories)

Feasible usage for run-time applicationsFeasible usage for run-time applications

Statistically overcome the translation unit (TU) Statistically overcome the translation unit (TU)

identification barrieridentification barrier

Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 3 of 17

Method BasicsMethod Basics

Extracts sub-sentential bilingual correspondencesExtracts sub-sentential bilingual correspondences

Statistical approachStatistical approach

Unique prerequisiteUnique prerequisite a parallel corpus a parallel corpus

Automatic translation unit identificationAutomatic translation unit identification

Two-level iterative method: Two-level iterative method:

Incrementally constructed translationIncrementally constructed translation

Continuously extended source segmentsContinuously extended source segments

Employs target language correspondence informationEmploys target language correspondence information

Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 4 of 17

Core Engine DescriptionCore Engine Description

Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 5 of 17

C

D

E

FW Filtering

C

E

Parallel TextDatabase

SSent-1SSent-2

.

.

.SSent-N

TSent-1TSent-2

.

.

.TSent-N

TW-1TW-2

.

.

.TW-k

Irrele

van

tw

ord

CTWSS TS

11st st - Level Iterations- Level Iterations

Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 6 of 17

Incremental translation construction:Incremental translation construction:

Employs DICE coefficient as similarity measureEmploys DICE coefficient as similarity measure

Adds one word from CTW set in every new iterationAdds one word from CTW set in every new iteration

Stores translations above threshold during an iterationStores translations above threshold during an iteration

Terminates when no new translation is addedTerminates when no new translation is added

Selects best translation based on similarity score and lengthSelects best translation based on similarity score and length

11st st - Level Iterations Example- Level Iterations Example

Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 7 of 17

ηλεκτρονική αυτόματη μετάδοση

electronic

automatic

transmission

Iteration 1

electronic automatic

automatic transmission

Iteration 2

electronic automatic transmission

Iteration 3

ECU

refer

EAT

Transmission EAT

EAT ECU

refer electronic automatic

automatic transmission EAT

refer electronic automatic

Transmission EAT

Translation Synthesis ExampleTranslation Synthesis Example

Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 8 of 17

ηλεκτρονική αυτόματη μετάδοση

electronic

automatic

transmission

Iteration 1

electronic automatic

automatic transmission

Iteration 2

electronic automatic transmission

Iteration 3

electronic automatic transmissionelectronic automatic transmission

ECU

refer

EAT

EAT ECU

automatic transmission EAT

a) length

b) score

22ndnd - Level Iterations- Level Iterations

Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 9 of 17

Aims of this 2Aims of this 2ndnd - level process: - level process:

Improve accuracy of translation outcomeImprove accuracy of translation outcome

Automatic translation unit identificationAutomatic translation unit identification

Efficient integration in a Translation Memory FrameworkEfficient integration in a Translation Memory Framework

22ndnd - Level Iterations- Level Iterations

Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 10 of 17

Employ Employ “Sequence Window Variety”“Sequence Window Variety” technique technique:

Try to dTry to determine the best “cover” of an input text by examining etermine the best “cover” of an input text by examining translation outcome of length-varying source segmentstranslation outcome of length-varying source segments

Initiate procedure from smallest segments (1-word segments) Initiate procedure from smallest segments (1-word segments)

Continuously extend the input source segmentsContinuously extend the input source segments

Shift observation window from left to right for source segmentsShift observation window from left to right for source segments

Store acceptable translations along with their score during every Store acceptable translations along with their score during every iterationiteration

CCombinatorial process ombinatorial process for for computing the optimal set of candidate computing the optimal set of candidate source units that providesource units that providess the best the best ““covercover””

22ndnd - Level Iterations Example- Level Iterations Example

Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 11 of 17

A B C D E F G HIteration 0

Iteration 0-a

Iteration 0-b

Iteration 0-c

Iteration 1-a

Iteration 1-b

Iteration 2-a

Iteration 2-b

Iteration 2-c

Iterations Source Sentence Input Phrase

A B C D E F G H D

A B C D E F G H E

A B C D E F G H D E

A B C D E F G H C D E

A B C D E F G H D E F

A B C D E F G H B C D E

A B C D E F G H C D E F

A B C D E F G H D E F G

Transmission EAT

Translation Synthesis Example (1)Translation Synthesis Example (1)

Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 12 of 17

ηλεκτρονική αυτόματη μετάδοση

electronic

permission

traction

electronic automatic

automatic transmission

electronicelectronic & & automatic transmissionautomatic transmission

ETC

force

EAT

EAT ECU a) length

b) score

ηλεκτρονική αυτόματη μετάδοση

fuse passenger

passenger compartment

Translation Synthesis Example (2)Translation Synthesis Example (2)

Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 13 of 17

ασφαλειοθήκη χώρου επιβατών

fuse

box

switch

passenger

compartment

fuse boxfuse box && passenger compartmentpassenger compartment

relay

ignition

fuse box

a) length

b) score

ασφαλειοθήκη χώρου επιβατών

compartment fuse box

Significant Technical AspectsSignificant Technical Aspects

N-gram based conflation N-gram based conflation method for enhancing the existing method for enhancing the existing

statistical evidencestatistical evidence (overcome limitations that morphologically (overcome limitations that morphologically

rich languages introduce)rich languages introduce)

Variable cut-off threshold Variable cut-off threshold (eliminate rejections of translation (eliminate rejections of translation

parts at an early stage of the algorithm)parts at an early stage of the algorithm)

Specific word order not taken into account Specific word order not taken into account (enhance statistical (enhance statistical

evidence in small bilingual corpora)evidence in small bilingual corpora)

Contiguity requirement Contiguity requirement (ensure translation accuracy)(ensure translation accuracy)

Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 14 of 17

EvaluationEvaluation

Evaluation set:Evaluation set:

350 input text fragments (80% noun phrases, 20% verb phrases)350 input text fragments (80% noun phrases, 20% verb phrases)

manually extracted from an automotive bilingual parallel manually extracted from an automotive bilingual parallel

corpus (3.100 EN words, 4.300 EL words)corpus (3.100 EN words, 4.300 EL words)

Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 15 of 17

Static WindowStatic Window Flexible WindowFlexible Window

CorrectCorrect 75%75% 83%83%

Second Second MatchMatch

8%8% 6%6%

ErrorsErrors 17%17% 11%11%

Future WorkFuture Work

Apply in comparable bilingual corporaApply in comparable bilingual corpora

Exploit linguistic information when availableExploit linguistic information when available

Explore ways of integrating in a Machine Translation & Explore ways of integrating in a Machine Translation &

Translation Memory frameworkTranslation Memory framework

Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 16 of 17

Integration in MT & TM FrameworkIntegration in MT & TM Framework

Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 17 of 17

TMTM

Statistical ProcessingStatistical Processing

Machine TranslationMachine Translation

A B C D E F G H

A B C D E F G H

Part 1 D E F Part 3

Part 2

Target SentenceTarget Sentence

Why DICEWhy DICE

Although the constituent words may have multiple senses, Although the constituent words may have multiple senses,

the identified TUs appear to have unique translationthe identified TUs appear to have unique translation

Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools

““current”current”::

a) present, existinga) present, existing

b) electricity (alternating ~)b) electricity (alternating ~)

““currentcurrent flows across”: flows across”:

a) a) ρεύμα περνά ρεύμα περνά (1 meaning)(1 meaning)

Better measure of Better measure of similaritysimilarity than MI and specific MI (log- than MI and specific MI (log-

likelihood ratio): 1-1, 1-0 matches are significant, 0-0 are likelihood ratio): 1-1, 1-0 matches are significant, 0-0 are

notnot

Good measures of independence are not necessarily good Good measures of independence are not necessarily good

measures of similarity…measures of similarity…

In practice, DICE works better!In practice, DICE works better!

Corpus SizeCorpus Size

Automotive industry bilingual corpus (EN-EL)Automotive industry bilingual corpus (EN-EL)

6.200 sentences in each language6.200 sentences in each language

3.100 EN words – 4.300 EL words3.100 EN words – 4.300 EL words

Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools

Champollion ApproachChampollion Approach

Tested in 2 different parts of Hansard corpus (Canadian Tested in 2 different parts of Hansard corpus (Canadian

Parliament) : 3.5 million & 8.5 million wordsParliament) : 3.5 million & 8.5 million words

65% - 75% accuracy was reported for the 3 evaluation sets65% - 75% accuracy was reported for the 3 evaluation sets

Proposed to increase database corpus for better resultsProposed to increase database corpus for better results

Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools

Conflation MethodConflation Method

N-gram methodN-gram method

Soft clustering of wordsSoft clustering of words

>98% accuracy (evaluated using the first 1000 entries >98% accuracy (evaluated using the first 1000 entries

of the ILSP morphological lexicon)of the ILSP morphological lexicon)

Works well even with small wordsWorks well even with small words

Most significant factor was the performance, so Most significant factor was the performance, so

emphasis was given on recallemphasis was given on recall

Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools

Conflation MethodsConflation Methods

Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools

Conflation Methods

Interactive Automatic

Suffix removal

Statistical Table-based

N-grams

LongestMatch

SimpleRemoval