Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
RANLP 2015
Natural Language Processing forTranslation Memories (NLP4TM)
Proceedings of the Workshop
September 11 2015Hissar Bulgaria
The Workshop onNatural Language Processing
for Translation Memories (NLP4TM)associated with THE INTERNATIONAL CONFERENCE
RECENT ADVANCES INNATURAL LANGUAGE PROCESSING 2015
PROCEEDINGS
Hissar Bulgaria11 September 2015
ii
Introduction
Translation Memories (TM) are amongst the most used tools by professional translators The underlyingidea of TMs is that a translator should benefit as much as possible from previous translations by beingable to retrieve how a similar sentence was translated before Despite the fact that the core idea of thesesystems relies on comparing segments (typically of sentence length) from the document to be translatedwith segments from previous translations most of the existing TM systems hardly use any languageprocessing for this Instead of addressing this issue most of the work on translation memories focusedon improving the user experience by allowing processing of a variety of document formats intuitive userinterfaces and so on
The term second generation translation memories has been around for more than ten years and it promisestranslation memory software that integrates linguistic processing in order to improve the translationprocess This linguistic processing can involve the matching of subsentential chunks the editing ofdistance operations between syntactic trees and the incorporation of semantic and discourse informationin the matching process Terminologies glossaries and ontologies are also very useful for translationmemories facilitating the task of the translator and ensuring a consistent translation The field of NaturalLanguage Processing (NLP) has proposed numerous methods for terminology extraction and ontologyextraction which can be integrated in the translation process The building of translation memories fromcorpora is another field where methods from NLP can contribute to improving the translation process
We are happy we could include in the workshop programme 4 long contributions and 3 short papersdealing with the aforementioned issues
Viacutet Baisa Aleš Horaacutek and Marek Medvedrsquo discuss in Increasing Coverage of Translation Memories withLinguistically Motivated Segment Combination Methods how it is possible to extend existing translationmemories using linguistically motivated segments combining approaches concentrated on preservinghigh translational quality
Linked to the topic of enhancing existing translation memories in the paper Spotting false translationsegments in translation memories Eduard Barbu presents a method for identifying false translations intranslation memories thought as a classification task
In Improving translation memory fuzzy matching by paraphrasing Konstantinos Chatzitheodorouexplores the use of paraphrasing in retrieving better segments from translation memories The methodrelies on NooJ and performs consistently better than the state of the art on EN-IT language pair
CATaLog New Approaches to TM and Post Editing Interfaces presents a new CAT tool by Tapas NayekSudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Vela and Josef van Genabith The aim ofthe tool is to improve both the performance and the productivity of post-editing
Carla Parra Escartiacuten aims at bridging the gap between academic research on Translation Memories (TMs)and the actual needs and wishes of translators in Creation of new TM segments Fulfilling translatorsrsquowishes She presents a pilot study where the requests of translators are being implemented in thetranslation workflow
Katerina Timonera and Ruslan Mitkov explore the use of clause splitting in better retrieval of segmentsTheir paper Improving Translation Memory Matching through Clause Splitting show that their methodleads to a statistically significant increase in the number of retrieved matches when both the inputsegments and the segments in the TM are first processed with a clause splitter
The Organising Committee would like to thank the Programme Committee who responded with very fastbut also substantial reviews for the workshop programme This workshop would not have been possible
iii
without the support received from the EXPERT project (FP72007-2013 under REA grant agreement no317471 httpexpert-itneu)
Constantin Orasan and Rohit Gupta
iv
Organizers
Constantin Orasan University of Wolverhampton UK
Rohit Gupta University of Wolverhampton UK
Program Committee
Manuel Arcedillo Hermes SpainJuanjo Arevalillo Hermes SpainEduard Barbu Translated ItalyYves Champollion WordFast FranceGloria Corpas University of Malaga SpainMaud Ehrmann EPFL SwitzerlandKevin Flanagan Swansea University UKGabriela Gonzalez eTrad ArgentinaManuel Herranz Pangeanic SpainQun Liu DCU IrelandRuslan Mitkov University of Wolverhampton UKGabor Proszeky Morphologic HungaryUwe Reinke Cologne University of Applied Sciences GermanyMichel Simard NRC CanadaMark Shuttleworth UCL UKMasao Utiyama NICT JapanAndy Way DCU IrelandMarcos Zampieri Saarland University and DFKI GermanyVentsislav Zhechev Autodesk
Invited Speaker
Marcello Federico Fondazione Bruno Kessler Italy
v
Table of Contents
Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten 1
Spotting false translation segments in translation memoriesEduard Barbu 9
Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov 17
Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou 24
Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Meth-ods
Viacutet Baisa Ales Horak and Marek Medvedrsquo 31
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Vela and Josef van
Genabith 36
vii
Conference Program
900ndash910 WelcomeConstantin Orasan
910ndash1010 Automatically tidying up and extending translation memories (invited talk)Marcello Federico FBK Italy
1015ndash1100 Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten
1130ndash1215 Spotting false translation segments in translation memoriesEduard Barbu
1215ndash1300 Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov
1430ndash1500 Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou
1500ndash1530 Increasing Coverage of Translation Memories with Linguistically Motivated Seg-ment Combination MethodsViacutet Baisa Ales Horak and Marek Medvedrsquo
1530ndash1600 CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Velaand Josef van Genabith
1615ndash1700 Round table and final discussion
ix
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 1ndash8Hissar Bulgaria Sept 2015
Creation of new TM segments Fulfilling translatorsrsquo wishes
Carla Parra EscartiacutenHermes Traducciones
C Coacutelquide 6 portal 2 3ž I28230 Las Rozas Madrid Spain
carlaparrahermestranscom
Abstract
This paper aims at bridging the gap be-tween academic research on TranslationMemories (TMs) and the actual needs andwishes of translators Starting from aninternal survey in a translation companywe analyse what translators wished trans-lation memories offered them We are cur-rently implementing one of their sugges-tions to assess its feasibility and whetheror not it retrieves more TM matches asoriginally expected We report on how thesuggestion is being implemented and theresults obtained in a pilot study
1 Introduction
Professional translators use translation memorieson a daily basis In fact most translation projectsnowadays require the usage of Computer AssistedTranslation (CAT) tools Sometimes translatorswill freely choose to work with a particular tooland sometimes it will be the client who imposesthe usage of such tool One of the main advantagesof CAT tools is that they allow for the integrationof different productivity enhancement features
Translation Memories (TMs) are used to re-trieve past translations and partial past translations(fuzzy matches) Terminology databases (TBs) areused to enhance the coherence on terminology andto ensure that the right terminology is used in allprojects Moreover some CAT tools offer addi-tional functionalities such as the usage of predic-tive texts (Autosuggest in SDL Trados Studio1
and Muses in MemoQ2) the automatic assem-bly of fragments to produce translations of newsegments or specific customizable Quality As-surance (QA) features
1wwwsdlcom2wwwmemoqcom
In the context of a translation company the us-age of these productivity enhacement tools is partof the whole translation workflow Project man-agers use them to generate word counts and esti-mate the time and resources needed to make thetranslation They also use them to pre-translatefiles and clean them up prior to delivery to theclient Translators use these tools to translate re-vise and proofread the translations prior to finaldelivery
Several researchers have worked on enhancingTMs using Natural Language Processing (NLP)techniques (Hodaacutesz and Pohl 2005 Pekar andMitkov 2007 Mitkov and Corpas 2008 Simardand Fujita 2012 Gupta and Orasan 2014Vanallemeersch and Vandeghinste 2015 Guptaet al 2015) Despite reporting positive resultsit seems that the gap between academic researchand industry still exists We carried out an inter-nal survey in our company to detect potential newfeatures for TMs that could be implemented byour RampD team The aim was to identify potentialnew features that project managers and translatorswished for and that would enhance their produc-tivity In this paper we report on the results of thatsurvey and further explain how we are implement-ing one of the suggestions we received The re-mainder of this paper is structured as follows Sec-tion 3 summarises the survey we carried out andthe replies we got and our company is briefly in-troduced in Section 2 Section 4 explains how weimplemented one of the suggestions we receivedSubsections 41ndash43 describe how we created newTM segments out of existing TMs Section 5 re-ports on the evaluation on a pilot test to assess thereal usability of this approach Finally section 6summarises our work and discusses future work
2 Our company
Hermes is a leader company in the translation in-dustry in Spain It was founded in 1991 and has
1
a vast experience in multilingual localisation andtranslation projects With offices in Madrid andMaacutelaga the company has broad knowledge andexperience in computer-assisted translation soft-ware and specific localisation software includingSDL Trados memoQ Deacutejagrave Vu IBM Translation-Manager Star Transit WordFast Catalyst Pas-solo Across Idiom World Server Microsoft He-lium and Microsoft Localisation Studio amongothers
Our company objectives are built upon a solidfoundation which have allowed us to achieve adouble quality certification for translation services(UNE-EN-150382006 and ISO-90012008 stan-dards) Our in-house translators and project man-agers commit daily to provide our clients with highquality translations for different specialised fields(IT medicine technical manuals general textsetc)
3 Internal survey
We asked our in-house translators and projectmanagers for potential new functionalities thatCAT tools could offer them More concretelywe asked them which was according to themthe missing functionality as far as TMs are con-cerned In total 10 project manager and 14 trans-lators participated in this internal survey Whilenot all had a clear idea of what could be imple-mented some interesting suggestions came up
We gathered ideas such as scheduling automaticTM reorganisation to prevent that large TMs endup corrupted It is a well known fact that largeTMs need to be periodically reorganized (ie re-indexed and eventually cleaned-up) While thisfeature is available in standard CAT tools such asStudio 20143 and memoQ4 it is not always car-ried out automatically and it is difficult to estimate
3In previous versions of Studio TM reorganisationwas available for all types of TMs in the Transla-tion Memory Settings Dialog Box gt Performance andTuning As of Studio 2014 file-based TMs are au-tomatically reorganised while server-based TMs re-quire periodical reorganisation For further informa-tion see httpproducthelpsdlcomSDL20Trados20Studioclient_enRefO-TTM_SettingsTranslation_Memory_Settings_Dialog_Box__Performance_and_Tuninghtm
4memoQ actually has a repairing function the Trans-lation memory repair wizard which aims at repairing(ie re-indexing) a corrupted TM According to thedocumentation it is also possible to run this functionon TMs which are not corrupted For further infor-mation see httpkilgraycommemoq2015help-enindexhtmlrepair_resource3html
when such reorganisation should be carried outScheduling it to be run automatically when a par-ticular number of segments (eg 500) have beenadded since the last reorganisation for instancewould prevent the loss of a TM because of badmaintenance
Ideas more related to NLP included the auto-matic correction of orthotypography in the tar-get language and allowing for multilingual TMswhere several source and target languages can beused for concordance searches at once Althoughcurrently it is possible to use multilingual TMs inCAT tools when starting a new project a languagepair has to be selected This leads to an underusageof the TM and the potential benefits of queryingmultiple languages at once are missed If for in-stance the TM contains a translation unit (TU) fora different pair of languages (eg German gt Ital-ian) than the ones selected for that specific project(eg German gt French) and the same source sen-tence is appearing in the text currently being trans-lated the translation into a different target lan-guage will not be shown (ie the Italian translationof the German TU will not be matched) The samewould occur with concordance searches Whilematches for a different language pair will not beused in the translation they may be useful for car-rying it out If a translator understands other targetlanguages these translations may give them a hintas to how to translate the same sentence into theirmother tongue5
Finally an interesting idea was to generate newsegments on the fly from fragments of previouslytranslated segments Flanagan (2015) offers aninteresting overview about the techniques usedby different CAT tools for subsegment matchingHere we will focus on memoQrsquos such functional-ity fragment assembly6
Figure 1 shows how this functionality worksin memoQ As may be observed memoQ looks
5For SDL Studio there seems to be an external appAnyTM that allows users to use TMs having differentlanguage pairs than the ones in the current translationproject As of Studio 2015 this app has been integrated inthe CAT tool and become a new feature However this tooldoes not seem to support the usage of truly multilingual TMs(TMs including several target languages for each segment)For further information on the tool see httpwwwtranslationzonecomopenexchangeappsdltradosstudioanytmtranslationprovider-669html)
6For further information see httpkilgraycommemoq2015help-enindexhtmlfragment_assemblyhtml
2
for fragments of the source sentence in other TMsegments and internally computes their alignmentprobabilities It then inserts the translations intothe source segment and suggests this new some-times partially translated sentence as a match Al-ternatively only the fragments translated will beinserted in the target segment one after anotherwithout the source sentence words that could notbe retrieved
Figure 1 memoQrsquos fragment assembly function-ality
One limitation of this functionality is that thefragment translations follow the order in whichthey appear in the source language Thus whileit may be very useful for pairs of languages whichfollow a similar structure it may be problematicfor pairs of languages which require reorderingmemoQ uses the frequencies of apparition of thefragments to select one translation or another foreach particular segment As a consequence insome cases the translation selection is wrong thusyielding wrong translation suggestions
Examples 1 and 2 show two similar sentencesin our TM
(1) EN The following message then appearsClick accept to run the programES Aparece el siguiente mensaje Hagaclick en aceptar para ejecutar el programa
(2) EN The window will show the followingThe application will closeES La ventana mostraraacute lo siguiente Secerraraacute la aplicacioacuten
Now imagine we need to translate the sentencein 3
(3) EN The following message then appearsThe application will close
Taking the part of 1 before the colon and the partof 2 after the colon we would be able to producethe right translation as shown in 4
(4) ES Aparece el siguiente mensaje Secerraraacute la aplicacioacuten
In technical texts it is often the case that situa-tions like the one just described happen more thanonce Thus it is not surprising that the transla-tors liked the idea and thought it would be nice tofind a way of automatically retrieving their trans-lations without having to do concordance searchesin the TM Moreover remembering that a partic-ular fragment of a segment had been translatedin the past is not always possible as translatorsmay have forgotten it or different translators mayhave been involved in the project thus not seeingfragments of a segment that other translators havetranslated already
As this idea seemed to have a great potential toincrease the number of TM and fuzzy matches wedecided to implement it and test whether it actu-ally worked The next Section (4) explains howwe proceeded
4 The new segment generator
As explained in Section 3 we decided to test oneidea originated from our internal survey The ideawas to generate new TM segments from fragmentsof already existing segments We called our newtool new segment generator
The first step was to assess the type of textsthat are translated in our company and identify thesegment fragments that could be easily extractedUpon analysis of several sample texts we identi-fied 7 different types of fragments we could workwith
1 Ellipsis2 Colons3 Parenthesis4 Square brackets5 Curly braces6 Quotations7 Text surrounded by tags
In the following subsections (41 ndash 43) we de-scribe how each of these types of fragments wastreated
3
41 Ellipsis and colonsOne possible type of segment would be that inwhich an ellipsis () or a colon () is used inthe middle of the segment In software localisationor user guides sentences such as the ones in 5 and6 could appear
(5) EN Installing new services ServiceXXXX for premium clients [2]ES Instalando nuevos servicios ServicioXXXX para clientes premium [2]
(6) EN You can use the line [abcdef] todescribe any of the following characters ab c d e fES Puede utilizar la liacutenea [abcdef] paradescribir cualquiera de los siguientescaracteres a b c d e f
If a different segment only including the textbefore the ellipsis appears as in Example 7 theTM may not retrieve any fuzzy match The samewould occur with other sentences with colons inwhich the fragment before or after the colon ap-pears
(7) Installing new services
In these cases we proceded as follows
1 Check that there is an ellipsis a colon onboth the source and the target segment
2 Split the segment in two being the first partthe fragment of the segment up to the ellip-siscolon and the second part the fragment ofthe segment after the ellipsiscolon
3 Create a new TM segment for each fragment
42 Parenthesis square brackets and curlybraces
Sometimes a sentence includes a fragment be-tween parethesis square brackets or curly bracesThe content within such characters may constitutea new segment on its own or appear in a differ-ent sentence At the same time it may also be thecase that the same sentence appears in the text butwithout such parenthesis When sentences like theones in Examples 8ndash10 appear it may thus be de-sirable to store the translation of the fragment be-tween the aforementioned characters and the sen-tence without such content
(8) EN Creates an installation package forapplication installation (if it was not createdearlier)ES Crea un paquete de instalacioacuten para lainstalacioacuten de la aplicacioacuten (si no se creoacuteantes)
(9) EN ltreturn code 1gt=[ltdescriptiongt]ES ltcoacutedigo de retorno 1gt=[ltdescripcioacutengt]
(10) EN Could not open key [2] Systemerror [3] Verify that you have sufficientaccess to that key or contact your supportpersonnelES Error al abrir la clave [2] Error en elsistema [3] Compruebe que dispone delos derechos de acceso
The strategy to create new segments was the fol-lowing
1 Check that there is content between parenthe-sis square brackets curly braces on both thesource and the target segment
2 Keep three fragments out of each sentence
(a) A sentence removing those charactersand the content between them
(b) A fragment starting at the opening char-acter and finishing on the closing oneand including the content within In thisfragment the parenthesis square brack-ets or curly braces are mantained
(c) A fragment containing only the con-tent withing those characters (parenthe-sis square brackets or curly braces) butwithout them
3 Create a new TM segment for each fragment
At this preliminary stage we considered thatwhen a sentence has several clauses in parenthesissquare brackets or curly braces these appear in thesame order in the target language This was doneso because for the type of texts used so far to testour application (software manuals) and the pair oflanguages used (English into Spanish) this seemsto be the usual case In future work we plan to fur-ther evaluate this issue and consider other waysof ensuring that the right translation is assigned toeach clause
4
43 Quotations and text within tagsQuotations and double tags appearing in the textwere handled differently As the text within thequotations or tags might be part of the sentencewhere it appears it could not be removed withoutadding too much noise to the data Thus we iden-tified sentences with quotations andor tags wethen removed the quotations andor tags and keptthe same sentence without them as a new segmentFinally we also kept the text within the quotationmarks or tags as new segments Examples 11ndash12illustrate this kind of segments
(11) EN You can only set the values of settingsthat the policy allows to be modified that isunlocked settingsES Solo se pueden establecer los valoresde los paraacutemetros que al directiva permitemodificar es decir los paraacutemetrosdesbloqueados
(12) EN If you clear the lt1gtInherit settingsfrom parent policylt1gt check box in thelt2gtSettings inheritancelt2gt section of thelt3gtGenerallt3gt section in the propertieswindow of an inherited policy the lockis lifted for that policyES Si anula la seleccioacuten de la casillalt1gtHeredar configuracioacuten de la directivaprimarialt1gt en la seccioacuten lt2gtHerencia deconfiguracioacutenlt2gt que aparece en la seccioacutenlt3gtGenerallt3gt de la ventana depropiedades de una directiva heredada seabriraacute el candado para esa directiva
5 Pilot test
Before integrating our system in a CAT tool andin our normal production workflows we deemedit better to run a pilot test The aim of this testwas to measure to which extent the new segmentsretrieved an increased number of 100 and fuzzymatches
51 Test setWe used as a test set a real translation project com-ing from one of our clients It is a software manualwritten in English and to be translated into Span-ish We selected memoQ 2015 to be the CAT toolused for our testing because it is one of the com-mon CAT tools used by our translators and be-cause we also wanted to measure the impact of
our approach when using its fragment assemblyfunctionality
The project had in total 425 segments account-ing for 6280 words according to memoQ Ta-ble 1 shows the project statistics as provided bymemoQrsquos analysis tool using the project TM pro-vided by the client Additionally memoQ iden-tified 36 segments (418 words) which could betranslated benefiting from its fragment assemblyfunctionality which uses fragments of segments tocreate new translations
TM match Words SegmentsRepetitions 1064 80
100 0 095-99 4 285-94 0 075-84 285 14
50-74 2523 187No Match 2404 142
Total 6280 425
Table 1 Project statistics according to memoQ us-ing the project TM
Taking this analysis as the starting point of ourpilot test we generated new segments using theapproach described in Section 4 We used threedifferent TMs to further assess whether the sizeof the translation memory matters for generatingtranslations of segments and retrieving more trans-lations The first TM (Project TM) was the projectTM provided by the client The second TM (Prod-uct TM) was a TM including all projects done forthe same product of the client Finally the thirdTM (Client TM) included all projects of that clientand thus was the biggest one Table 2 summarisesthe size of the three TMs
Segments WordsEN ES
Project TM 16842 212472 244159456631
Product TM 20923 274542 317797592339
Client TM 256099 3427861 39517327379593
Table 2 Size of the different TMs used
We then generated new TM segments and storedthem as new TMs Table 3 shows the number ofnew segments generated using our approach
Table 4 breaks down the number of segmentsgenerated using each strategy and for each TM
5
Segments EN ESProject TM 6776 56973 66297
new segments 123270Product TM 7760 71034 83125
new segments 154159new Client TM 74041 662714 769705new segments 1432419
Table 3 Size of the new TMs generated using ourapproach
used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific
TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792
Table 4 Number of newly generated segments pertype and TM used
We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following
1 TM1 The project TM as provided by theclient
2 TM2 Only the new segments generatedfrom the project TM provided by the client
3 TM3 A combination of the project TM andthe new segments retrieved from it
4 TM4 Only the new segments generatedfrom the product TM
5 TM5 The project TM and the new segmentsgenerated from the product TM
6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM
7 TM7 Only the new segments generatedfrom the client TM
8 TM8 The project TM combined with thenew segments generated from the client TM
9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM
10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM
11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM
The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation
As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)
However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments
When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-
6
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194
Table 5 Overview on the number of segments pre-translated translated using fragments or not started
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11
100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87
Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band
erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)
If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band
In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations
from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators
6 Conclusion and future work
In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent
We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators
We have also envisioned the combination ofalready existing methods for retrieving a higher
7
number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)
Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesKevin Flanagan 2015 Subsegment recall in Transla-
tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January
Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)
Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT
Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria
Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008
Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters
Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas
Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT
8
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015
Spotting false translation segments in translation memories
Eduard BarbuTranslatednet
eduardtranslatednet
Abstract
The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories
1 Introduction
MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface
The quality of the translations using the firstmethod is high and the errors are relatively few
1httpsmymemorytranslatednet
6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V
Figure 1 The distribution of automatically ac-quired memories in MyMemory
However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions
The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task
The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments
9
2 Related Work
The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval
There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora
A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours
2httpwwwxbenchnet
3 The data
In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types
1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web
2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)
3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments
4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)
The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors
10
It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1
CG =ls minus ldradic
34(ls + ld)(1)
In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments
If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426
The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the
3This simple idea is implemented in many sentence align-ers
4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom
Church Gale ScoreP
rob
ab
ility C
hu
rch
Ga
le S
co
re
minus4 minus2 0 2 4
00
02
04
06
08
Figure 2 The distribution of Church Gale Scoresin the Matecat Set
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
01
02
03
04
Figure 3 The distribution of Church Gale Scoresin the Collaborative Set
11
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
00
00
04
00
08
Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet
Church Gale score varies in the interval minus131516015
To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6
In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows
bull Training Set It contains 1243 bi-segmentsand has 373 negative example
bull Test Set It contains 309 bi-segments and has87 negatives examples
The proportion of the negative examples in bothsets is approximately 30
minus4 minus2 0 2 4
minus4
minus2
02
4Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 5 The Q-Q plot for the Matecat set
minus4 minus2 0 2 4
minus1
00
minus5
00
50
Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 6 The Q-Q plot for the Collaborative set
12
4 Machine Learning
In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them
41 FeaturesThe features computed for the training and test setare the following
bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt
bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language
bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0
bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0
bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0
bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0
bull has capital letters The value of the featureis 1 if the source or target segments contain
words that have at least a capital letter other-wise is 0
bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters
bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions
bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)
bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties
bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity
bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity
13
bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment
bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters
bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters
bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2
All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation
The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment
is done with the highly accurate language detectorCybozu5
42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)
bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree
bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees
bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores
bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms
bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances
bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can
5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd
14
give good results in classification problemswhere the decision boundary is irregular
5 Results and discussion
We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1
Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071
Table 1 The results of the three-fold stratifiedclassification
Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)
The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2
The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set
Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059
Table 2 The results of the classification on the testset
Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section
6 Conclusions and further work
In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the
15
performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database
There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative
Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471
ReferencesWilliam A Gale and Kenneth W Church 1993 A
program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS
B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April
Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-
lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31
Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA
F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830
Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226
Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory
Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas
16
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015
Abstract
We propose the integration of clause
splitting as a pre-processing step for match
retrieval in Translation Memory (TM)
systems to increase the number of relevant
sub-segment matches Through a series of
experiments we investigate the impact of
clause splitting in instances where the input
does not match an entire segment in the
TM but only a clause from a segment Our
results show that there is a statistically
significant increase in the number of
retrieved matches when both the input
segments and the segments in the TM are
first processed with a clause splitter
1 Rationale
Translation memory tools have had a great
impact on the translation industry as they
provide considerable assistance to translators
They allow translators to easily re-use
previous translations providing them with
valuable productivity gains in an industry
where there is a great demand for quality
translation delivered in the shortest possible
time However existing tools have some
shortcomings The majority of existing tools
rely on Levenshtein distance and seek to
identify matches only at the sentence level
Semantically similar segments are therefore
difficult to retrieve if the string similarity is not
high enough as are sub-segment matches
because if only part of a sentence matches the
input even if this part is an entire clause it is
unlikely that this sentence would be retrieved
(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text
types such as updated versions of technical
manuals
In this study we aim to address the
problem of retrieving sub-segment matches by
performing clause splitting on the source
segment as a pre-processing step for TM match
retrieval While matches for entire sentences
or almost entire sentences are the most useful
type of matches it is also less likely for such
matches to be found in most text types and
even less so for complex sentences Retrieving
clauses is desirable because there is a higher
chance for a match to be found for a clause than
for a complex sentence and at the same time
clauses are similar to sentences in that they
both contain a subject and a verb hence a
ldquocomplete thoughtrdquo therefore clause matches
are more likely to be in context and to actually
be used by the translator than phrase matches
for example
We perform experiments comparing the
match retrieval performance of TM tools when
they are used as is and when the input file and
the TM segments are first processed with a
clause splitter before being fed into the TM
tool The paper is organised as follows In
section 2 we discuss related work on TM
matching In section 3 we discuss how clause
splitting can be beneficial to TM matching and
how clause splitting was implemented for this
study We then describe our experiments in
section 4 and discuss the results in section 5
Finally we present our conclusion and future
work in section 6
Improving Translation Memory Matching through Clause
Splitting
Katerina Timonera
Research Group in Computational
Linguistics
University of Wolverhampton
krtimoneragmailcom
Ruslan Mitkov
Research Group in Computational
Linguistics
University of Wolverhampton
rmitkovwlvacuk
17
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
The Workshop onNatural Language Processing
for Translation Memories (NLP4TM)associated with THE INTERNATIONAL CONFERENCE
RECENT ADVANCES INNATURAL LANGUAGE PROCESSING 2015
PROCEEDINGS
Hissar Bulgaria11 September 2015
ii
Introduction
Translation Memories (TM) are amongst the most used tools by professional translators The underlyingidea of TMs is that a translator should benefit as much as possible from previous translations by beingable to retrieve how a similar sentence was translated before Despite the fact that the core idea of thesesystems relies on comparing segments (typically of sentence length) from the document to be translatedwith segments from previous translations most of the existing TM systems hardly use any languageprocessing for this Instead of addressing this issue most of the work on translation memories focusedon improving the user experience by allowing processing of a variety of document formats intuitive userinterfaces and so on
The term second generation translation memories has been around for more than ten years and it promisestranslation memory software that integrates linguistic processing in order to improve the translationprocess This linguistic processing can involve the matching of subsentential chunks the editing ofdistance operations between syntactic trees and the incorporation of semantic and discourse informationin the matching process Terminologies glossaries and ontologies are also very useful for translationmemories facilitating the task of the translator and ensuring a consistent translation The field of NaturalLanguage Processing (NLP) has proposed numerous methods for terminology extraction and ontologyextraction which can be integrated in the translation process The building of translation memories fromcorpora is another field where methods from NLP can contribute to improving the translation process
We are happy we could include in the workshop programme 4 long contributions and 3 short papersdealing with the aforementioned issues
Viacutet Baisa Aleš Horaacutek and Marek Medvedrsquo discuss in Increasing Coverage of Translation Memories withLinguistically Motivated Segment Combination Methods how it is possible to extend existing translationmemories using linguistically motivated segments combining approaches concentrated on preservinghigh translational quality
Linked to the topic of enhancing existing translation memories in the paper Spotting false translationsegments in translation memories Eduard Barbu presents a method for identifying false translations intranslation memories thought as a classification task
In Improving translation memory fuzzy matching by paraphrasing Konstantinos Chatzitheodorouexplores the use of paraphrasing in retrieving better segments from translation memories The methodrelies on NooJ and performs consistently better than the state of the art on EN-IT language pair
CATaLog New Approaches to TM and Post Editing Interfaces presents a new CAT tool by Tapas NayekSudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Vela and Josef van Genabith The aim ofthe tool is to improve both the performance and the productivity of post-editing
Carla Parra Escartiacuten aims at bridging the gap between academic research on Translation Memories (TMs)and the actual needs and wishes of translators in Creation of new TM segments Fulfilling translatorsrsquowishes She presents a pilot study where the requests of translators are being implemented in thetranslation workflow
Katerina Timonera and Ruslan Mitkov explore the use of clause splitting in better retrieval of segmentsTheir paper Improving Translation Memory Matching through Clause Splitting show that their methodleads to a statistically significant increase in the number of retrieved matches when both the inputsegments and the segments in the TM are first processed with a clause splitter
The Organising Committee would like to thank the Programme Committee who responded with very fastbut also substantial reviews for the workshop programme This workshop would not have been possible
iii
without the support received from the EXPERT project (FP72007-2013 under REA grant agreement no317471 httpexpert-itneu)
Constantin Orasan and Rohit Gupta
iv
Organizers
Constantin Orasan University of Wolverhampton UK
Rohit Gupta University of Wolverhampton UK
Program Committee
Manuel Arcedillo Hermes SpainJuanjo Arevalillo Hermes SpainEduard Barbu Translated ItalyYves Champollion WordFast FranceGloria Corpas University of Malaga SpainMaud Ehrmann EPFL SwitzerlandKevin Flanagan Swansea University UKGabriela Gonzalez eTrad ArgentinaManuel Herranz Pangeanic SpainQun Liu DCU IrelandRuslan Mitkov University of Wolverhampton UKGabor Proszeky Morphologic HungaryUwe Reinke Cologne University of Applied Sciences GermanyMichel Simard NRC CanadaMark Shuttleworth UCL UKMasao Utiyama NICT JapanAndy Way DCU IrelandMarcos Zampieri Saarland University and DFKI GermanyVentsislav Zhechev Autodesk
Invited Speaker
Marcello Federico Fondazione Bruno Kessler Italy
v
Table of Contents
Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten 1
Spotting false translation segments in translation memoriesEduard Barbu 9
Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov 17
Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou 24
Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Meth-ods
Viacutet Baisa Ales Horak and Marek Medvedrsquo 31
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Vela and Josef van
Genabith 36
vii
Conference Program
900ndash910 WelcomeConstantin Orasan
910ndash1010 Automatically tidying up and extending translation memories (invited talk)Marcello Federico FBK Italy
1015ndash1100 Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten
1130ndash1215 Spotting false translation segments in translation memoriesEduard Barbu
1215ndash1300 Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov
1430ndash1500 Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou
1500ndash1530 Increasing Coverage of Translation Memories with Linguistically Motivated Seg-ment Combination MethodsViacutet Baisa Ales Horak and Marek Medvedrsquo
1530ndash1600 CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Velaand Josef van Genabith
1615ndash1700 Round table and final discussion
ix
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 1ndash8Hissar Bulgaria Sept 2015
Creation of new TM segments Fulfilling translatorsrsquo wishes
Carla Parra EscartiacutenHermes Traducciones
C Coacutelquide 6 portal 2 3ž I28230 Las Rozas Madrid Spain
carlaparrahermestranscom
Abstract
This paper aims at bridging the gap be-tween academic research on TranslationMemories (TMs) and the actual needs andwishes of translators Starting from aninternal survey in a translation companywe analyse what translators wished trans-lation memories offered them We are cur-rently implementing one of their sugges-tions to assess its feasibility and whetheror not it retrieves more TM matches asoriginally expected We report on how thesuggestion is being implemented and theresults obtained in a pilot study
1 Introduction
Professional translators use translation memorieson a daily basis In fact most translation projectsnowadays require the usage of Computer AssistedTranslation (CAT) tools Sometimes translatorswill freely choose to work with a particular tooland sometimes it will be the client who imposesthe usage of such tool One of the main advantagesof CAT tools is that they allow for the integrationof different productivity enhancement features
Translation Memories (TMs) are used to re-trieve past translations and partial past translations(fuzzy matches) Terminology databases (TBs) areused to enhance the coherence on terminology andto ensure that the right terminology is used in allprojects Moreover some CAT tools offer addi-tional functionalities such as the usage of predic-tive texts (Autosuggest in SDL Trados Studio1
and Muses in MemoQ2) the automatic assem-bly of fragments to produce translations of newsegments or specific customizable Quality As-surance (QA) features
1wwwsdlcom2wwwmemoqcom
In the context of a translation company the us-age of these productivity enhacement tools is partof the whole translation workflow Project man-agers use them to generate word counts and esti-mate the time and resources needed to make thetranslation They also use them to pre-translatefiles and clean them up prior to delivery to theclient Translators use these tools to translate re-vise and proofread the translations prior to finaldelivery
Several researchers have worked on enhancingTMs using Natural Language Processing (NLP)techniques (Hodaacutesz and Pohl 2005 Pekar andMitkov 2007 Mitkov and Corpas 2008 Simardand Fujita 2012 Gupta and Orasan 2014Vanallemeersch and Vandeghinste 2015 Guptaet al 2015) Despite reporting positive resultsit seems that the gap between academic researchand industry still exists We carried out an inter-nal survey in our company to detect potential newfeatures for TMs that could be implemented byour RampD team The aim was to identify potentialnew features that project managers and translatorswished for and that would enhance their produc-tivity In this paper we report on the results of thatsurvey and further explain how we are implement-ing one of the suggestions we received The re-mainder of this paper is structured as follows Sec-tion 3 summarises the survey we carried out andthe replies we got and our company is briefly in-troduced in Section 2 Section 4 explains how weimplemented one of the suggestions we receivedSubsections 41ndash43 describe how we created newTM segments out of existing TMs Section 5 re-ports on the evaluation on a pilot test to assess thereal usability of this approach Finally section 6summarises our work and discusses future work
2 Our company
Hermes is a leader company in the translation in-dustry in Spain It was founded in 1991 and has
1
a vast experience in multilingual localisation andtranslation projects With offices in Madrid andMaacutelaga the company has broad knowledge andexperience in computer-assisted translation soft-ware and specific localisation software includingSDL Trados memoQ Deacutejagrave Vu IBM Translation-Manager Star Transit WordFast Catalyst Pas-solo Across Idiom World Server Microsoft He-lium and Microsoft Localisation Studio amongothers
Our company objectives are built upon a solidfoundation which have allowed us to achieve adouble quality certification for translation services(UNE-EN-150382006 and ISO-90012008 stan-dards) Our in-house translators and project man-agers commit daily to provide our clients with highquality translations for different specialised fields(IT medicine technical manuals general textsetc)
3 Internal survey
We asked our in-house translators and projectmanagers for potential new functionalities thatCAT tools could offer them More concretelywe asked them which was according to themthe missing functionality as far as TMs are con-cerned In total 10 project manager and 14 trans-lators participated in this internal survey Whilenot all had a clear idea of what could be imple-mented some interesting suggestions came up
We gathered ideas such as scheduling automaticTM reorganisation to prevent that large TMs endup corrupted It is a well known fact that largeTMs need to be periodically reorganized (ie re-indexed and eventually cleaned-up) While thisfeature is available in standard CAT tools such asStudio 20143 and memoQ4 it is not always car-ried out automatically and it is difficult to estimate
3In previous versions of Studio TM reorganisationwas available for all types of TMs in the Transla-tion Memory Settings Dialog Box gt Performance andTuning As of Studio 2014 file-based TMs are au-tomatically reorganised while server-based TMs re-quire periodical reorganisation For further informa-tion see httpproducthelpsdlcomSDL20Trados20Studioclient_enRefO-TTM_SettingsTranslation_Memory_Settings_Dialog_Box__Performance_and_Tuninghtm
4memoQ actually has a repairing function the Trans-lation memory repair wizard which aims at repairing(ie re-indexing) a corrupted TM According to thedocumentation it is also possible to run this functionon TMs which are not corrupted For further infor-mation see httpkilgraycommemoq2015help-enindexhtmlrepair_resource3html
when such reorganisation should be carried outScheduling it to be run automatically when a par-ticular number of segments (eg 500) have beenadded since the last reorganisation for instancewould prevent the loss of a TM because of badmaintenance
Ideas more related to NLP included the auto-matic correction of orthotypography in the tar-get language and allowing for multilingual TMswhere several source and target languages can beused for concordance searches at once Althoughcurrently it is possible to use multilingual TMs inCAT tools when starting a new project a languagepair has to be selected This leads to an underusageof the TM and the potential benefits of queryingmultiple languages at once are missed If for in-stance the TM contains a translation unit (TU) fora different pair of languages (eg German gt Ital-ian) than the ones selected for that specific project(eg German gt French) and the same source sen-tence is appearing in the text currently being trans-lated the translation into a different target lan-guage will not be shown (ie the Italian translationof the German TU will not be matched) The samewould occur with concordance searches Whilematches for a different language pair will not beused in the translation they may be useful for car-rying it out If a translator understands other targetlanguages these translations may give them a hintas to how to translate the same sentence into theirmother tongue5
Finally an interesting idea was to generate newsegments on the fly from fragments of previouslytranslated segments Flanagan (2015) offers aninteresting overview about the techniques usedby different CAT tools for subsegment matchingHere we will focus on memoQrsquos such functional-ity fragment assembly6
Figure 1 shows how this functionality worksin memoQ As may be observed memoQ looks
5For SDL Studio there seems to be an external appAnyTM that allows users to use TMs having differentlanguage pairs than the ones in the current translationproject As of Studio 2015 this app has been integrated inthe CAT tool and become a new feature However this tooldoes not seem to support the usage of truly multilingual TMs(TMs including several target languages for each segment)For further information on the tool see httpwwwtranslationzonecomopenexchangeappsdltradosstudioanytmtranslationprovider-669html)
6For further information see httpkilgraycommemoq2015help-enindexhtmlfragment_assemblyhtml
2
for fragments of the source sentence in other TMsegments and internally computes their alignmentprobabilities It then inserts the translations intothe source segment and suggests this new some-times partially translated sentence as a match Al-ternatively only the fragments translated will beinserted in the target segment one after anotherwithout the source sentence words that could notbe retrieved
Figure 1 memoQrsquos fragment assembly function-ality
One limitation of this functionality is that thefragment translations follow the order in whichthey appear in the source language Thus whileit may be very useful for pairs of languages whichfollow a similar structure it may be problematicfor pairs of languages which require reorderingmemoQ uses the frequencies of apparition of thefragments to select one translation or another foreach particular segment As a consequence insome cases the translation selection is wrong thusyielding wrong translation suggestions
Examples 1 and 2 show two similar sentencesin our TM
(1) EN The following message then appearsClick accept to run the programES Aparece el siguiente mensaje Hagaclick en aceptar para ejecutar el programa
(2) EN The window will show the followingThe application will closeES La ventana mostraraacute lo siguiente Secerraraacute la aplicacioacuten
Now imagine we need to translate the sentencein 3
(3) EN The following message then appearsThe application will close
Taking the part of 1 before the colon and the partof 2 after the colon we would be able to producethe right translation as shown in 4
(4) ES Aparece el siguiente mensaje Secerraraacute la aplicacioacuten
In technical texts it is often the case that situa-tions like the one just described happen more thanonce Thus it is not surprising that the transla-tors liked the idea and thought it would be nice tofind a way of automatically retrieving their trans-lations without having to do concordance searchesin the TM Moreover remembering that a partic-ular fragment of a segment had been translatedin the past is not always possible as translatorsmay have forgotten it or different translators mayhave been involved in the project thus not seeingfragments of a segment that other translators havetranslated already
As this idea seemed to have a great potential toincrease the number of TM and fuzzy matches wedecided to implement it and test whether it actu-ally worked The next Section (4) explains howwe proceeded
4 The new segment generator
As explained in Section 3 we decided to test oneidea originated from our internal survey The ideawas to generate new TM segments from fragmentsof already existing segments We called our newtool new segment generator
The first step was to assess the type of textsthat are translated in our company and identify thesegment fragments that could be easily extractedUpon analysis of several sample texts we identi-fied 7 different types of fragments we could workwith
1 Ellipsis2 Colons3 Parenthesis4 Square brackets5 Curly braces6 Quotations7 Text surrounded by tags
In the following subsections (41 ndash 43) we de-scribe how each of these types of fragments wastreated
3
41 Ellipsis and colonsOne possible type of segment would be that inwhich an ellipsis () or a colon () is used inthe middle of the segment In software localisationor user guides sentences such as the ones in 5 and6 could appear
(5) EN Installing new services ServiceXXXX for premium clients [2]ES Instalando nuevos servicios ServicioXXXX para clientes premium [2]
(6) EN You can use the line [abcdef] todescribe any of the following characters ab c d e fES Puede utilizar la liacutenea [abcdef] paradescribir cualquiera de los siguientescaracteres a b c d e f
If a different segment only including the textbefore the ellipsis appears as in Example 7 theTM may not retrieve any fuzzy match The samewould occur with other sentences with colons inwhich the fragment before or after the colon ap-pears
(7) Installing new services
In these cases we proceded as follows
1 Check that there is an ellipsis a colon onboth the source and the target segment
2 Split the segment in two being the first partthe fragment of the segment up to the ellip-siscolon and the second part the fragment ofthe segment after the ellipsiscolon
3 Create a new TM segment for each fragment
42 Parenthesis square brackets and curlybraces
Sometimes a sentence includes a fragment be-tween parethesis square brackets or curly bracesThe content within such characters may constitutea new segment on its own or appear in a differ-ent sentence At the same time it may also be thecase that the same sentence appears in the text butwithout such parenthesis When sentences like theones in Examples 8ndash10 appear it may thus be de-sirable to store the translation of the fragment be-tween the aforementioned characters and the sen-tence without such content
(8) EN Creates an installation package forapplication installation (if it was not createdearlier)ES Crea un paquete de instalacioacuten para lainstalacioacuten de la aplicacioacuten (si no se creoacuteantes)
(9) EN ltreturn code 1gt=[ltdescriptiongt]ES ltcoacutedigo de retorno 1gt=[ltdescripcioacutengt]
(10) EN Could not open key [2] Systemerror [3] Verify that you have sufficientaccess to that key or contact your supportpersonnelES Error al abrir la clave [2] Error en elsistema [3] Compruebe que dispone delos derechos de acceso
The strategy to create new segments was the fol-lowing
1 Check that there is content between parenthe-sis square brackets curly braces on both thesource and the target segment
2 Keep three fragments out of each sentence
(a) A sentence removing those charactersand the content between them
(b) A fragment starting at the opening char-acter and finishing on the closing oneand including the content within In thisfragment the parenthesis square brack-ets or curly braces are mantained
(c) A fragment containing only the con-tent withing those characters (parenthe-sis square brackets or curly braces) butwithout them
3 Create a new TM segment for each fragment
At this preliminary stage we considered thatwhen a sentence has several clauses in parenthesissquare brackets or curly braces these appear in thesame order in the target language This was doneso because for the type of texts used so far to testour application (software manuals) and the pair oflanguages used (English into Spanish) this seemsto be the usual case In future work we plan to fur-ther evaluate this issue and consider other waysof ensuring that the right translation is assigned toeach clause
4
43 Quotations and text within tagsQuotations and double tags appearing in the textwere handled differently As the text within thequotations or tags might be part of the sentencewhere it appears it could not be removed withoutadding too much noise to the data Thus we iden-tified sentences with quotations andor tags wethen removed the quotations andor tags and keptthe same sentence without them as a new segmentFinally we also kept the text within the quotationmarks or tags as new segments Examples 11ndash12illustrate this kind of segments
(11) EN You can only set the values of settingsthat the policy allows to be modified that isunlocked settingsES Solo se pueden establecer los valoresde los paraacutemetros que al directiva permitemodificar es decir los paraacutemetrosdesbloqueados
(12) EN If you clear the lt1gtInherit settingsfrom parent policylt1gt check box in thelt2gtSettings inheritancelt2gt section of thelt3gtGenerallt3gt section in the propertieswindow of an inherited policy the lockis lifted for that policyES Si anula la seleccioacuten de la casillalt1gtHeredar configuracioacuten de la directivaprimarialt1gt en la seccioacuten lt2gtHerencia deconfiguracioacutenlt2gt que aparece en la seccioacutenlt3gtGenerallt3gt de la ventana depropiedades de una directiva heredada seabriraacute el candado para esa directiva
5 Pilot test
Before integrating our system in a CAT tool andin our normal production workflows we deemedit better to run a pilot test The aim of this testwas to measure to which extent the new segmentsretrieved an increased number of 100 and fuzzymatches
51 Test setWe used as a test set a real translation project com-ing from one of our clients It is a software manualwritten in English and to be translated into Span-ish We selected memoQ 2015 to be the CAT toolused for our testing because it is one of the com-mon CAT tools used by our translators and be-cause we also wanted to measure the impact of
our approach when using its fragment assemblyfunctionality
The project had in total 425 segments account-ing for 6280 words according to memoQ Ta-ble 1 shows the project statistics as provided bymemoQrsquos analysis tool using the project TM pro-vided by the client Additionally memoQ iden-tified 36 segments (418 words) which could betranslated benefiting from its fragment assemblyfunctionality which uses fragments of segments tocreate new translations
TM match Words SegmentsRepetitions 1064 80
100 0 095-99 4 285-94 0 075-84 285 14
50-74 2523 187No Match 2404 142
Total 6280 425
Table 1 Project statistics according to memoQ us-ing the project TM
Taking this analysis as the starting point of ourpilot test we generated new segments using theapproach described in Section 4 We used threedifferent TMs to further assess whether the sizeof the translation memory matters for generatingtranslations of segments and retrieving more trans-lations The first TM (Project TM) was the projectTM provided by the client The second TM (Prod-uct TM) was a TM including all projects done forthe same product of the client Finally the thirdTM (Client TM) included all projects of that clientand thus was the biggest one Table 2 summarisesthe size of the three TMs
Segments WordsEN ES
Project TM 16842 212472 244159456631
Product TM 20923 274542 317797592339
Client TM 256099 3427861 39517327379593
Table 2 Size of the different TMs used
We then generated new TM segments and storedthem as new TMs Table 3 shows the number ofnew segments generated using our approach
Table 4 breaks down the number of segmentsgenerated using each strategy and for each TM
5
Segments EN ESProject TM 6776 56973 66297
new segments 123270Product TM 7760 71034 83125
new segments 154159new Client TM 74041 662714 769705new segments 1432419
Table 3 Size of the new TMs generated using ourapproach
used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific
TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792
Table 4 Number of newly generated segments pertype and TM used
We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following
1 TM1 The project TM as provided by theclient
2 TM2 Only the new segments generatedfrom the project TM provided by the client
3 TM3 A combination of the project TM andthe new segments retrieved from it
4 TM4 Only the new segments generatedfrom the product TM
5 TM5 The project TM and the new segmentsgenerated from the product TM
6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM
7 TM7 Only the new segments generatedfrom the client TM
8 TM8 The project TM combined with thenew segments generated from the client TM
9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM
10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM
11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM
The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation
As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)
However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments
When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-
6
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194
Table 5 Overview on the number of segments pre-translated translated using fragments or not started
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11
100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87
Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band
erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)
If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band
In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations
from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators
6 Conclusion and future work
In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent
We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators
We have also envisioned the combination ofalready existing methods for retrieving a higher
7
number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)
Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesKevin Flanagan 2015 Subsegment recall in Transla-
tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January
Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)
Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT
Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria
Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008
Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters
Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas
Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT
8
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015
Spotting false translation segments in translation memories
Eduard BarbuTranslatednet
eduardtranslatednet
Abstract
The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories
1 Introduction
MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface
The quality of the translations using the firstmethod is high and the errors are relatively few
1httpsmymemorytranslatednet
6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V
Figure 1 The distribution of automatically ac-quired memories in MyMemory
However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions
The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task
The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments
9
2 Related Work
The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval
There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora
A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours
2httpwwwxbenchnet
3 The data
In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types
1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web
2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)
3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments
4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)
The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors
10
It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1
CG =ls minus ldradic
34(ls + ld)(1)
In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments
If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426
The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the
3This simple idea is implemented in many sentence align-ers
4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom
Church Gale ScoreP
rob
ab
ility C
hu
rch
Ga
le S
co
re
minus4 minus2 0 2 4
00
02
04
06
08
Figure 2 The distribution of Church Gale Scoresin the Matecat Set
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
01
02
03
04
Figure 3 The distribution of Church Gale Scoresin the Collaborative Set
11
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
00
00
04
00
08
Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet
Church Gale score varies in the interval minus131516015
To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6
In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows
bull Training Set It contains 1243 bi-segmentsand has 373 negative example
bull Test Set It contains 309 bi-segments and has87 negatives examples
The proportion of the negative examples in bothsets is approximately 30
minus4 minus2 0 2 4
minus4
minus2
02
4Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 5 The Q-Q plot for the Matecat set
minus4 minus2 0 2 4
minus1
00
minus5
00
50
Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 6 The Q-Q plot for the Collaborative set
12
4 Machine Learning
In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them
41 FeaturesThe features computed for the training and test setare the following
bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt
bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language
bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0
bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0
bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0
bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0
bull has capital letters The value of the featureis 1 if the source or target segments contain
words that have at least a capital letter other-wise is 0
bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters
bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions
bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)
bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties
bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity
bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity
13
bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment
bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters
bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters
bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2
All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation
The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment
is done with the highly accurate language detectorCybozu5
42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)
bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree
bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees
bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores
bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms
bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances
bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can
5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd
14
give good results in classification problemswhere the decision boundary is irregular
5 Results and discussion
We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1
Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071
Table 1 The results of the three-fold stratifiedclassification
Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)
The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2
The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set
Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059
Table 2 The results of the classification on the testset
Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section
6 Conclusions and further work
In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the
15
performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database
There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative
Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471
ReferencesWilliam A Gale and Kenneth W Church 1993 A
program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS
B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April
Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-
lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31
Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA
F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830
Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226
Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory
Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas
16
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015
Abstract
We propose the integration of clause
splitting as a pre-processing step for match
retrieval in Translation Memory (TM)
systems to increase the number of relevant
sub-segment matches Through a series of
experiments we investigate the impact of
clause splitting in instances where the input
does not match an entire segment in the
TM but only a clause from a segment Our
results show that there is a statistically
significant increase in the number of
retrieved matches when both the input
segments and the segments in the TM are
first processed with a clause splitter
1 Rationale
Translation memory tools have had a great
impact on the translation industry as they
provide considerable assistance to translators
They allow translators to easily re-use
previous translations providing them with
valuable productivity gains in an industry
where there is a great demand for quality
translation delivered in the shortest possible
time However existing tools have some
shortcomings The majority of existing tools
rely on Levenshtein distance and seek to
identify matches only at the sentence level
Semantically similar segments are therefore
difficult to retrieve if the string similarity is not
high enough as are sub-segment matches
because if only part of a sentence matches the
input even if this part is an entire clause it is
unlikely that this sentence would be retrieved
(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text
types such as updated versions of technical
manuals
In this study we aim to address the
problem of retrieving sub-segment matches by
performing clause splitting on the source
segment as a pre-processing step for TM match
retrieval While matches for entire sentences
or almost entire sentences are the most useful
type of matches it is also less likely for such
matches to be found in most text types and
even less so for complex sentences Retrieving
clauses is desirable because there is a higher
chance for a match to be found for a clause than
for a complex sentence and at the same time
clauses are similar to sentences in that they
both contain a subject and a verb hence a
ldquocomplete thoughtrdquo therefore clause matches
are more likely to be in context and to actually
be used by the translator than phrase matches
for example
We perform experiments comparing the
match retrieval performance of TM tools when
they are used as is and when the input file and
the TM segments are first processed with a
clause splitter before being fed into the TM
tool The paper is organised as follows In
section 2 we discuss related work on TM
matching In section 3 we discuss how clause
splitting can be beneficial to TM matching and
how clause splitting was implemented for this
study We then describe our experiments in
section 4 and discuss the results in section 5
Finally we present our conclusion and future
work in section 6
Improving Translation Memory Matching through Clause
Splitting
Katerina Timonera
Research Group in Computational
Linguistics
University of Wolverhampton
krtimoneragmailcom
Ruslan Mitkov
Research Group in Computational
Linguistics
University of Wolverhampton
rmitkovwlvacuk
17
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
Introduction
Translation Memories (TM) are amongst the most used tools by professional translators The underlyingidea of TMs is that a translator should benefit as much as possible from previous translations by beingable to retrieve how a similar sentence was translated before Despite the fact that the core idea of thesesystems relies on comparing segments (typically of sentence length) from the document to be translatedwith segments from previous translations most of the existing TM systems hardly use any languageprocessing for this Instead of addressing this issue most of the work on translation memories focusedon improving the user experience by allowing processing of a variety of document formats intuitive userinterfaces and so on
The term second generation translation memories has been around for more than ten years and it promisestranslation memory software that integrates linguistic processing in order to improve the translationprocess This linguistic processing can involve the matching of subsentential chunks the editing ofdistance operations between syntactic trees and the incorporation of semantic and discourse informationin the matching process Terminologies glossaries and ontologies are also very useful for translationmemories facilitating the task of the translator and ensuring a consistent translation The field of NaturalLanguage Processing (NLP) has proposed numerous methods for terminology extraction and ontologyextraction which can be integrated in the translation process The building of translation memories fromcorpora is another field where methods from NLP can contribute to improving the translation process
We are happy we could include in the workshop programme 4 long contributions and 3 short papersdealing with the aforementioned issues
Viacutet Baisa Aleš Horaacutek and Marek Medvedrsquo discuss in Increasing Coverage of Translation Memories withLinguistically Motivated Segment Combination Methods how it is possible to extend existing translationmemories using linguistically motivated segments combining approaches concentrated on preservinghigh translational quality
Linked to the topic of enhancing existing translation memories in the paper Spotting false translationsegments in translation memories Eduard Barbu presents a method for identifying false translations intranslation memories thought as a classification task
In Improving translation memory fuzzy matching by paraphrasing Konstantinos Chatzitheodorouexplores the use of paraphrasing in retrieving better segments from translation memories The methodrelies on NooJ and performs consistently better than the state of the art on EN-IT language pair
CATaLog New Approaches to TM and Post Editing Interfaces presents a new CAT tool by Tapas NayekSudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Vela and Josef van Genabith The aim ofthe tool is to improve both the performance and the productivity of post-editing
Carla Parra Escartiacuten aims at bridging the gap between academic research on Translation Memories (TMs)and the actual needs and wishes of translators in Creation of new TM segments Fulfilling translatorsrsquowishes She presents a pilot study where the requests of translators are being implemented in thetranslation workflow
Katerina Timonera and Ruslan Mitkov explore the use of clause splitting in better retrieval of segmentsTheir paper Improving Translation Memory Matching through Clause Splitting show that their methodleads to a statistically significant increase in the number of retrieved matches when both the inputsegments and the segments in the TM are first processed with a clause splitter
The Organising Committee would like to thank the Programme Committee who responded with very fastbut also substantial reviews for the workshop programme This workshop would not have been possible
iii
without the support received from the EXPERT project (FP72007-2013 under REA grant agreement no317471 httpexpert-itneu)
Constantin Orasan and Rohit Gupta
iv
Organizers
Constantin Orasan University of Wolverhampton UK
Rohit Gupta University of Wolverhampton UK
Program Committee
Manuel Arcedillo Hermes SpainJuanjo Arevalillo Hermes SpainEduard Barbu Translated ItalyYves Champollion WordFast FranceGloria Corpas University of Malaga SpainMaud Ehrmann EPFL SwitzerlandKevin Flanagan Swansea University UKGabriela Gonzalez eTrad ArgentinaManuel Herranz Pangeanic SpainQun Liu DCU IrelandRuslan Mitkov University of Wolverhampton UKGabor Proszeky Morphologic HungaryUwe Reinke Cologne University of Applied Sciences GermanyMichel Simard NRC CanadaMark Shuttleworth UCL UKMasao Utiyama NICT JapanAndy Way DCU IrelandMarcos Zampieri Saarland University and DFKI GermanyVentsislav Zhechev Autodesk
Invited Speaker
Marcello Federico Fondazione Bruno Kessler Italy
v
Table of Contents
Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten 1
Spotting false translation segments in translation memoriesEduard Barbu 9
Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov 17
Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou 24
Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Meth-ods
Viacutet Baisa Ales Horak and Marek Medvedrsquo 31
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Vela and Josef van
Genabith 36
vii
Conference Program
900ndash910 WelcomeConstantin Orasan
910ndash1010 Automatically tidying up and extending translation memories (invited talk)Marcello Federico FBK Italy
1015ndash1100 Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten
1130ndash1215 Spotting false translation segments in translation memoriesEduard Barbu
1215ndash1300 Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov
1430ndash1500 Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou
1500ndash1530 Increasing Coverage of Translation Memories with Linguistically Motivated Seg-ment Combination MethodsViacutet Baisa Ales Horak and Marek Medvedrsquo
1530ndash1600 CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Velaand Josef van Genabith
1615ndash1700 Round table and final discussion
ix
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 1ndash8Hissar Bulgaria Sept 2015
Creation of new TM segments Fulfilling translatorsrsquo wishes
Carla Parra EscartiacutenHermes Traducciones
C Coacutelquide 6 portal 2 3ž I28230 Las Rozas Madrid Spain
carlaparrahermestranscom
Abstract
This paper aims at bridging the gap be-tween academic research on TranslationMemories (TMs) and the actual needs andwishes of translators Starting from aninternal survey in a translation companywe analyse what translators wished trans-lation memories offered them We are cur-rently implementing one of their sugges-tions to assess its feasibility and whetheror not it retrieves more TM matches asoriginally expected We report on how thesuggestion is being implemented and theresults obtained in a pilot study
1 Introduction
Professional translators use translation memorieson a daily basis In fact most translation projectsnowadays require the usage of Computer AssistedTranslation (CAT) tools Sometimes translatorswill freely choose to work with a particular tooland sometimes it will be the client who imposesthe usage of such tool One of the main advantagesof CAT tools is that they allow for the integrationof different productivity enhancement features
Translation Memories (TMs) are used to re-trieve past translations and partial past translations(fuzzy matches) Terminology databases (TBs) areused to enhance the coherence on terminology andto ensure that the right terminology is used in allprojects Moreover some CAT tools offer addi-tional functionalities such as the usage of predic-tive texts (Autosuggest in SDL Trados Studio1
and Muses in MemoQ2) the automatic assem-bly of fragments to produce translations of newsegments or specific customizable Quality As-surance (QA) features
1wwwsdlcom2wwwmemoqcom
In the context of a translation company the us-age of these productivity enhacement tools is partof the whole translation workflow Project man-agers use them to generate word counts and esti-mate the time and resources needed to make thetranslation They also use them to pre-translatefiles and clean them up prior to delivery to theclient Translators use these tools to translate re-vise and proofread the translations prior to finaldelivery
Several researchers have worked on enhancingTMs using Natural Language Processing (NLP)techniques (Hodaacutesz and Pohl 2005 Pekar andMitkov 2007 Mitkov and Corpas 2008 Simardand Fujita 2012 Gupta and Orasan 2014Vanallemeersch and Vandeghinste 2015 Guptaet al 2015) Despite reporting positive resultsit seems that the gap between academic researchand industry still exists We carried out an inter-nal survey in our company to detect potential newfeatures for TMs that could be implemented byour RampD team The aim was to identify potentialnew features that project managers and translatorswished for and that would enhance their produc-tivity In this paper we report on the results of thatsurvey and further explain how we are implement-ing one of the suggestions we received The re-mainder of this paper is structured as follows Sec-tion 3 summarises the survey we carried out andthe replies we got and our company is briefly in-troduced in Section 2 Section 4 explains how weimplemented one of the suggestions we receivedSubsections 41ndash43 describe how we created newTM segments out of existing TMs Section 5 re-ports on the evaluation on a pilot test to assess thereal usability of this approach Finally section 6summarises our work and discusses future work
2 Our company
Hermes is a leader company in the translation in-dustry in Spain It was founded in 1991 and has
1
a vast experience in multilingual localisation andtranslation projects With offices in Madrid andMaacutelaga the company has broad knowledge andexperience in computer-assisted translation soft-ware and specific localisation software includingSDL Trados memoQ Deacutejagrave Vu IBM Translation-Manager Star Transit WordFast Catalyst Pas-solo Across Idiom World Server Microsoft He-lium and Microsoft Localisation Studio amongothers
Our company objectives are built upon a solidfoundation which have allowed us to achieve adouble quality certification for translation services(UNE-EN-150382006 and ISO-90012008 stan-dards) Our in-house translators and project man-agers commit daily to provide our clients with highquality translations for different specialised fields(IT medicine technical manuals general textsetc)
3 Internal survey
We asked our in-house translators and projectmanagers for potential new functionalities thatCAT tools could offer them More concretelywe asked them which was according to themthe missing functionality as far as TMs are con-cerned In total 10 project manager and 14 trans-lators participated in this internal survey Whilenot all had a clear idea of what could be imple-mented some interesting suggestions came up
We gathered ideas such as scheduling automaticTM reorganisation to prevent that large TMs endup corrupted It is a well known fact that largeTMs need to be periodically reorganized (ie re-indexed and eventually cleaned-up) While thisfeature is available in standard CAT tools such asStudio 20143 and memoQ4 it is not always car-ried out automatically and it is difficult to estimate
3In previous versions of Studio TM reorganisationwas available for all types of TMs in the Transla-tion Memory Settings Dialog Box gt Performance andTuning As of Studio 2014 file-based TMs are au-tomatically reorganised while server-based TMs re-quire periodical reorganisation For further informa-tion see httpproducthelpsdlcomSDL20Trados20Studioclient_enRefO-TTM_SettingsTranslation_Memory_Settings_Dialog_Box__Performance_and_Tuninghtm
4memoQ actually has a repairing function the Trans-lation memory repair wizard which aims at repairing(ie re-indexing) a corrupted TM According to thedocumentation it is also possible to run this functionon TMs which are not corrupted For further infor-mation see httpkilgraycommemoq2015help-enindexhtmlrepair_resource3html
when such reorganisation should be carried outScheduling it to be run automatically when a par-ticular number of segments (eg 500) have beenadded since the last reorganisation for instancewould prevent the loss of a TM because of badmaintenance
Ideas more related to NLP included the auto-matic correction of orthotypography in the tar-get language and allowing for multilingual TMswhere several source and target languages can beused for concordance searches at once Althoughcurrently it is possible to use multilingual TMs inCAT tools when starting a new project a languagepair has to be selected This leads to an underusageof the TM and the potential benefits of queryingmultiple languages at once are missed If for in-stance the TM contains a translation unit (TU) fora different pair of languages (eg German gt Ital-ian) than the ones selected for that specific project(eg German gt French) and the same source sen-tence is appearing in the text currently being trans-lated the translation into a different target lan-guage will not be shown (ie the Italian translationof the German TU will not be matched) The samewould occur with concordance searches Whilematches for a different language pair will not beused in the translation they may be useful for car-rying it out If a translator understands other targetlanguages these translations may give them a hintas to how to translate the same sentence into theirmother tongue5
Finally an interesting idea was to generate newsegments on the fly from fragments of previouslytranslated segments Flanagan (2015) offers aninteresting overview about the techniques usedby different CAT tools for subsegment matchingHere we will focus on memoQrsquos such functional-ity fragment assembly6
Figure 1 shows how this functionality worksin memoQ As may be observed memoQ looks
5For SDL Studio there seems to be an external appAnyTM that allows users to use TMs having differentlanguage pairs than the ones in the current translationproject As of Studio 2015 this app has been integrated inthe CAT tool and become a new feature However this tooldoes not seem to support the usage of truly multilingual TMs(TMs including several target languages for each segment)For further information on the tool see httpwwwtranslationzonecomopenexchangeappsdltradosstudioanytmtranslationprovider-669html)
6For further information see httpkilgraycommemoq2015help-enindexhtmlfragment_assemblyhtml
2
for fragments of the source sentence in other TMsegments and internally computes their alignmentprobabilities It then inserts the translations intothe source segment and suggests this new some-times partially translated sentence as a match Al-ternatively only the fragments translated will beinserted in the target segment one after anotherwithout the source sentence words that could notbe retrieved
Figure 1 memoQrsquos fragment assembly function-ality
One limitation of this functionality is that thefragment translations follow the order in whichthey appear in the source language Thus whileit may be very useful for pairs of languages whichfollow a similar structure it may be problematicfor pairs of languages which require reorderingmemoQ uses the frequencies of apparition of thefragments to select one translation or another foreach particular segment As a consequence insome cases the translation selection is wrong thusyielding wrong translation suggestions
Examples 1 and 2 show two similar sentencesin our TM
(1) EN The following message then appearsClick accept to run the programES Aparece el siguiente mensaje Hagaclick en aceptar para ejecutar el programa
(2) EN The window will show the followingThe application will closeES La ventana mostraraacute lo siguiente Secerraraacute la aplicacioacuten
Now imagine we need to translate the sentencein 3
(3) EN The following message then appearsThe application will close
Taking the part of 1 before the colon and the partof 2 after the colon we would be able to producethe right translation as shown in 4
(4) ES Aparece el siguiente mensaje Secerraraacute la aplicacioacuten
In technical texts it is often the case that situa-tions like the one just described happen more thanonce Thus it is not surprising that the transla-tors liked the idea and thought it would be nice tofind a way of automatically retrieving their trans-lations without having to do concordance searchesin the TM Moreover remembering that a partic-ular fragment of a segment had been translatedin the past is not always possible as translatorsmay have forgotten it or different translators mayhave been involved in the project thus not seeingfragments of a segment that other translators havetranslated already
As this idea seemed to have a great potential toincrease the number of TM and fuzzy matches wedecided to implement it and test whether it actu-ally worked The next Section (4) explains howwe proceeded
4 The new segment generator
As explained in Section 3 we decided to test oneidea originated from our internal survey The ideawas to generate new TM segments from fragmentsof already existing segments We called our newtool new segment generator
The first step was to assess the type of textsthat are translated in our company and identify thesegment fragments that could be easily extractedUpon analysis of several sample texts we identi-fied 7 different types of fragments we could workwith
1 Ellipsis2 Colons3 Parenthesis4 Square brackets5 Curly braces6 Quotations7 Text surrounded by tags
In the following subsections (41 ndash 43) we de-scribe how each of these types of fragments wastreated
3
41 Ellipsis and colonsOne possible type of segment would be that inwhich an ellipsis () or a colon () is used inthe middle of the segment In software localisationor user guides sentences such as the ones in 5 and6 could appear
(5) EN Installing new services ServiceXXXX for premium clients [2]ES Instalando nuevos servicios ServicioXXXX para clientes premium [2]
(6) EN You can use the line [abcdef] todescribe any of the following characters ab c d e fES Puede utilizar la liacutenea [abcdef] paradescribir cualquiera de los siguientescaracteres a b c d e f
If a different segment only including the textbefore the ellipsis appears as in Example 7 theTM may not retrieve any fuzzy match The samewould occur with other sentences with colons inwhich the fragment before or after the colon ap-pears
(7) Installing new services
In these cases we proceded as follows
1 Check that there is an ellipsis a colon onboth the source and the target segment
2 Split the segment in two being the first partthe fragment of the segment up to the ellip-siscolon and the second part the fragment ofthe segment after the ellipsiscolon
3 Create a new TM segment for each fragment
42 Parenthesis square brackets and curlybraces
Sometimes a sentence includes a fragment be-tween parethesis square brackets or curly bracesThe content within such characters may constitutea new segment on its own or appear in a differ-ent sentence At the same time it may also be thecase that the same sentence appears in the text butwithout such parenthesis When sentences like theones in Examples 8ndash10 appear it may thus be de-sirable to store the translation of the fragment be-tween the aforementioned characters and the sen-tence without such content
(8) EN Creates an installation package forapplication installation (if it was not createdearlier)ES Crea un paquete de instalacioacuten para lainstalacioacuten de la aplicacioacuten (si no se creoacuteantes)
(9) EN ltreturn code 1gt=[ltdescriptiongt]ES ltcoacutedigo de retorno 1gt=[ltdescripcioacutengt]
(10) EN Could not open key [2] Systemerror [3] Verify that you have sufficientaccess to that key or contact your supportpersonnelES Error al abrir la clave [2] Error en elsistema [3] Compruebe que dispone delos derechos de acceso
The strategy to create new segments was the fol-lowing
1 Check that there is content between parenthe-sis square brackets curly braces on both thesource and the target segment
2 Keep three fragments out of each sentence
(a) A sentence removing those charactersand the content between them
(b) A fragment starting at the opening char-acter and finishing on the closing oneand including the content within In thisfragment the parenthesis square brack-ets or curly braces are mantained
(c) A fragment containing only the con-tent withing those characters (parenthe-sis square brackets or curly braces) butwithout them
3 Create a new TM segment for each fragment
At this preliminary stage we considered thatwhen a sentence has several clauses in parenthesissquare brackets or curly braces these appear in thesame order in the target language This was doneso because for the type of texts used so far to testour application (software manuals) and the pair oflanguages used (English into Spanish) this seemsto be the usual case In future work we plan to fur-ther evaluate this issue and consider other waysof ensuring that the right translation is assigned toeach clause
4
43 Quotations and text within tagsQuotations and double tags appearing in the textwere handled differently As the text within thequotations or tags might be part of the sentencewhere it appears it could not be removed withoutadding too much noise to the data Thus we iden-tified sentences with quotations andor tags wethen removed the quotations andor tags and keptthe same sentence without them as a new segmentFinally we also kept the text within the quotationmarks or tags as new segments Examples 11ndash12illustrate this kind of segments
(11) EN You can only set the values of settingsthat the policy allows to be modified that isunlocked settingsES Solo se pueden establecer los valoresde los paraacutemetros que al directiva permitemodificar es decir los paraacutemetrosdesbloqueados
(12) EN If you clear the lt1gtInherit settingsfrom parent policylt1gt check box in thelt2gtSettings inheritancelt2gt section of thelt3gtGenerallt3gt section in the propertieswindow of an inherited policy the lockis lifted for that policyES Si anula la seleccioacuten de la casillalt1gtHeredar configuracioacuten de la directivaprimarialt1gt en la seccioacuten lt2gtHerencia deconfiguracioacutenlt2gt que aparece en la seccioacutenlt3gtGenerallt3gt de la ventana depropiedades de una directiva heredada seabriraacute el candado para esa directiva
5 Pilot test
Before integrating our system in a CAT tool andin our normal production workflows we deemedit better to run a pilot test The aim of this testwas to measure to which extent the new segmentsretrieved an increased number of 100 and fuzzymatches
51 Test setWe used as a test set a real translation project com-ing from one of our clients It is a software manualwritten in English and to be translated into Span-ish We selected memoQ 2015 to be the CAT toolused for our testing because it is one of the com-mon CAT tools used by our translators and be-cause we also wanted to measure the impact of
our approach when using its fragment assemblyfunctionality
The project had in total 425 segments account-ing for 6280 words according to memoQ Ta-ble 1 shows the project statistics as provided bymemoQrsquos analysis tool using the project TM pro-vided by the client Additionally memoQ iden-tified 36 segments (418 words) which could betranslated benefiting from its fragment assemblyfunctionality which uses fragments of segments tocreate new translations
TM match Words SegmentsRepetitions 1064 80
100 0 095-99 4 285-94 0 075-84 285 14
50-74 2523 187No Match 2404 142
Total 6280 425
Table 1 Project statistics according to memoQ us-ing the project TM
Taking this analysis as the starting point of ourpilot test we generated new segments using theapproach described in Section 4 We used threedifferent TMs to further assess whether the sizeof the translation memory matters for generatingtranslations of segments and retrieving more trans-lations The first TM (Project TM) was the projectTM provided by the client The second TM (Prod-uct TM) was a TM including all projects done forthe same product of the client Finally the thirdTM (Client TM) included all projects of that clientand thus was the biggest one Table 2 summarisesthe size of the three TMs
Segments WordsEN ES
Project TM 16842 212472 244159456631
Product TM 20923 274542 317797592339
Client TM 256099 3427861 39517327379593
Table 2 Size of the different TMs used
We then generated new TM segments and storedthem as new TMs Table 3 shows the number ofnew segments generated using our approach
Table 4 breaks down the number of segmentsgenerated using each strategy and for each TM
5
Segments EN ESProject TM 6776 56973 66297
new segments 123270Product TM 7760 71034 83125
new segments 154159new Client TM 74041 662714 769705new segments 1432419
Table 3 Size of the new TMs generated using ourapproach
used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific
TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792
Table 4 Number of newly generated segments pertype and TM used
We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following
1 TM1 The project TM as provided by theclient
2 TM2 Only the new segments generatedfrom the project TM provided by the client
3 TM3 A combination of the project TM andthe new segments retrieved from it
4 TM4 Only the new segments generatedfrom the product TM
5 TM5 The project TM and the new segmentsgenerated from the product TM
6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM
7 TM7 Only the new segments generatedfrom the client TM
8 TM8 The project TM combined with thenew segments generated from the client TM
9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM
10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM
11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM
The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation
As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)
However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments
When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-
6
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194
Table 5 Overview on the number of segments pre-translated translated using fragments or not started
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11
100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87
Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band
erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)
If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band
In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations
from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators
6 Conclusion and future work
In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent
We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators
We have also envisioned the combination ofalready existing methods for retrieving a higher
7
number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)
Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesKevin Flanagan 2015 Subsegment recall in Transla-
tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January
Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)
Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT
Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria
Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008
Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters
Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas
Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT
8
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015
Spotting false translation segments in translation memories
Eduard BarbuTranslatednet
eduardtranslatednet
Abstract
The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories
1 Introduction
MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface
The quality of the translations using the firstmethod is high and the errors are relatively few
1httpsmymemorytranslatednet
6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V
Figure 1 The distribution of automatically ac-quired memories in MyMemory
However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions
The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task
The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments
9
2 Related Work
The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval
There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora
A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours
2httpwwwxbenchnet
3 The data
In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types
1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web
2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)
3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments
4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)
The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors
10
It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1
CG =ls minus ldradic
34(ls + ld)(1)
In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments
If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426
The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the
3This simple idea is implemented in many sentence align-ers
4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom
Church Gale ScoreP
rob
ab
ility C
hu
rch
Ga
le S
co
re
minus4 minus2 0 2 4
00
02
04
06
08
Figure 2 The distribution of Church Gale Scoresin the Matecat Set
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
01
02
03
04
Figure 3 The distribution of Church Gale Scoresin the Collaborative Set
11
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
00
00
04
00
08
Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet
Church Gale score varies in the interval minus131516015
To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6
In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows
bull Training Set It contains 1243 bi-segmentsand has 373 negative example
bull Test Set It contains 309 bi-segments and has87 negatives examples
The proportion of the negative examples in bothsets is approximately 30
minus4 minus2 0 2 4
minus4
minus2
02
4Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 5 The Q-Q plot for the Matecat set
minus4 minus2 0 2 4
minus1
00
minus5
00
50
Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 6 The Q-Q plot for the Collaborative set
12
4 Machine Learning
In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them
41 FeaturesThe features computed for the training and test setare the following
bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt
bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language
bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0
bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0
bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0
bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0
bull has capital letters The value of the featureis 1 if the source or target segments contain
words that have at least a capital letter other-wise is 0
bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters
bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions
bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)
bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties
bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity
bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity
13
bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment
bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters
bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters
bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2
All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation
The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment
is done with the highly accurate language detectorCybozu5
42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)
bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree
bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees
bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores
bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms
bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances
bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can
5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd
14
give good results in classification problemswhere the decision boundary is irregular
5 Results and discussion
We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1
Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071
Table 1 The results of the three-fold stratifiedclassification
Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)
The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2
The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set
Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059
Table 2 The results of the classification on the testset
Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section
6 Conclusions and further work
In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the
15
performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database
There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative
Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471
ReferencesWilliam A Gale and Kenneth W Church 1993 A
program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS
B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April
Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-
lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31
Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA
F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830
Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226
Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory
Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas
16
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015
Abstract
We propose the integration of clause
splitting as a pre-processing step for match
retrieval in Translation Memory (TM)
systems to increase the number of relevant
sub-segment matches Through a series of
experiments we investigate the impact of
clause splitting in instances where the input
does not match an entire segment in the
TM but only a clause from a segment Our
results show that there is a statistically
significant increase in the number of
retrieved matches when both the input
segments and the segments in the TM are
first processed with a clause splitter
1 Rationale
Translation memory tools have had a great
impact on the translation industry as they
provide considerable assistance to translators
They allow translators to easily re-use
previous translations providing them with
valuable productivity gains in an industry
where there is a great demand for quality
translation delivered in the shortest possible
time However existing tools have some
shortcomings The majority of existing tools
rely on Levenshtein distance and seek to
identify matches only at the sentence level
Semantically similar segments are therefore
difficult to retrieve if the string similarity is not
high enough as are sub-segment matches
because if only part of a sentence matches the
input even if this part is an entire clause it is
unlikely that this sentence would be retrieved
(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text
types such as updated versions of technical
manuals
In this study we aim to address the
problem of retrieving sub-segment matches by
performing clause splitting on the source
segment as a pre-processing step for TM match
retrieval While matches for entire sentences
or almost entire sentences are the most useful
type of matches it is also less likely for such
matches to be found in most text types and
even less so for complex sentences Retrieving
clauses is desirable because there is a higher
chance for a match to be found for a clause than
for a complex sentence and at the same time
clauses are similar to sentences in that they
both contain a subject and a verb hence a
ldquocomplete thoughtrdquo therefore clause matches
are more likely to be in context and to actually
be used by the translator than phrase matches
for example
We perform experiments comparing the
match retrieval performance of TM tools when
they are used as is and when the input file and
the TM segments are first processed with a
clause splitter before being fed into the TM
tool The paper is organised as follows In
section 2 we discuss related work on TM
matching In section 3 we discuss how clause
splitting can be beneficial to TM matching and
how clause splitting was implemented for this
study We then describe our experiments in
section 4 and discuss the results in section 5
Finally we present our conclusion and future
work in section 6
Improving Translation Memory Matching through Clause
Splitting
Katerina Timonera
Research Group in Computational
Linguistics
University of Wolverhampton
krtimoneragmailcom
Ruslan Mitkov
Research Group in Computational
Linguistics
University of Wolverhampton
rmitkovwlvacuk
17
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
without the support received from the EXPERT project (FP72007-2013 under REA grant agreement no317471 httpexpert-itneu)
Constantin Orasan and Rohit Gupta
iv
Organizers
Constantin Orasan University of Wolverhampton UK
Rohit Gupta University of Wolverhampton UK
Program Committee
Manuel Arcedillo Hermes SpainJuanjo Arevalillo Hermes SpainEduard Barbu Translated ItalyYves Champollion WordFast FranceGloria Corpas University of Malaga SpainMaud Ehrmann EPFL SwitzerlandKevin Flanagan Swansea University UKGabriela Gonzalez eTrad ArgentinaManuel Herranz Pangeanic SpainQun Liu DCU IrelandRuslan Mitkov University of Wolverhampton UKGabor Proszeky Morphologic HungaryUwe Reinke Cologne University of Applied Sciences GermanyMichel Simard NRC CanadaMark Shuttleworth UCL UKMasao Utiyama NICT JapanAndy Way DCU IrelandMarcos Zampieri Saarland University and DFKI GermanyVentsislav Zhechev Autodesk
Invited Speaker
Marcello Federico Fondazione Bruno Kessler Italy
v
Table of Contents
Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten 1
Spotting false translation segments in translation memoriesEduard Barbu 9
Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov 17
Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou 24
Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Meth-ods
Viacutet Baisa Ales Horak and Marek Medvedrsquo 31
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Vela and Josef van
Genabith 36
vii
Conference Program
900ndash910 WelcomeConstantin Orasan
910ndash1010 Automatically tidying up and extending translation memories (invited talk)Marcello Federico FBK Italy
1015ndash1100 Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten
1130ndash1215 Spotting false translation segments in translation memoriesEduard Barbu
1215ndash1300 Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov
1430ndash1500 Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou
1500ndash1530 Increasing Coverage of Translation Memories with Linguistically Motivated Seg-ment Combination MethodsViacutet Baisa Ales Horak and Marek Medvedrsquo
1530ndash1600 CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Velaand Josef van Genabith
1615ndash1700 Round table and final discussion
ix
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 1ndash8Hissar Bulgaria Sept 2015
Creation of new TM segments Fulfilling translatorsrsquo wishes
Carla Parra EscartiacutenHermes Traducciones
C Coacutelquide 6 portal 2 3ž I28230 Las Rozas Madrid Spain
carlaparrahermestranscom
Abstract
This paper aims at bridging the gap be-tween academic research on TranslationMemories (TMs) and the actual needs andwishes of translators Starting from aninternal survey in a translation companywe analyse what translators wished trans-lation memories offered them We are cur-rently implementing one of their sugges-tions to assess its feasibility and whetheror not it retrieves more TM matches asoriginally expected We report on how thesuggestion is being implemented and theresults obtained in a pilot study
1 Introduction
Professional translators use translation memorieson a daily basis In fact most translation projectsnowadays require the usage of Computer AssistedTranslation (CAT) tools Sometimes translatorswill freely choose to work with a particular tooland sometimes it will be the client who imposesthe usage of such tool One of the main advantagesof CAT tools is that they allow for the integrationof different productivity enhancement features
Translation Memories (TMs) are used to re-trieve past translations and partial past translations(fuzzy matches) Terminology databases (TBs) areused to enhance the coherence on terminology andto ensure that the right terminology is used in allprojects Moreover some CAT tools offer addi-tional functionalities such as the usage of predic-tive texts (Autosuggest in SDL Trados Studio1
and Muses in MemoQ2) the automatic assem-bly of fragments to produce translations of newsegments or specific customizable Quality As-surance (QA) features
1wwwsdlcom2wwwmemoqcom
In the context of a translation company the us-age of these productivity enhacement tools is partof the whole translation workflow Project man-agers use them to generate word counts and esti-mate the time and resources needed to make thetranslation They also use them to pre-translatefiles and clean them up prior to delivery to theclient Translators use these tools to translate re-vise and proofread the translations prior to finaldelivery
Several researchers have worked on enhancingTMs using Natural Language Processing (NLP)techniques (Hodaacutesz and Pohl 2005 Pekar andMitkov 2007 Mitkov and Corpas 2008 Simardand Fujita 2012 Gupta and Orasan 2014Vanallemeersch and Vandeghinste 2015 Guptaet al 2015) Despite reporting positive resultsit seems that the gap between academic researchand industry still exists We carried out an inter-nal survey in our company to detect potential newfeatures for TMs that could be implemented byour RampD team The aim was to identify potentialnew features that project managers and translatorswished for and that would enhance their produc-tivity In this paper we report on the results of thatsurvey and further explain how we are implement-ing one of the suggestions we received The re-mainder of this paper is structured as follows Sec-tion 3 summarises the survey we carried out andthe replies we got and our company is briefly in-troduced in Section 2 Section 4 explains how weimplemented one of the suggestions we receivedSubsections 41ndash43 describe how we created newTM segments out of existing TMs Section 5 re-ports on the evaluation on a pilot test to assess thereal usability of this approach Finally section 6summarises our work and discusses future work
2 Our company
Hermes is a leader company in the translation in-dustry in Spain It was founded in 1991 and has
1
a vast experience in multilingual localisation andtranslation projects With offices in Madrid andMaacutelaga the company has broad knowledge andexperience in computer-assisted translation soft-ware and specific localisation software includingSDL Trados memoQ Deacutejagrave Vu IBM Translation-Manager Star Transit WordFast Catalyst Pas-solo Across Idiom World Server Microsoft He-lium and Microsoft Localisation Studio amongothers
Our company objectives are built upon a solidfoundation which have allowed us to achieve adouble quality certification for translation services(UNE-EN-150382006 and ISO-90012008 stan-dards) Our in-house translators and project man-agers commit daily to provide our clients with highquality translations for different specialised fields(IT medicine technical manuals general textsetc)
3 Internal survey
We asked our in-house translators and projectmanagers for potential new functionalities thatCAT tools could offer them More concretelywe asked them which was according to themthe missing functionality as far as TMs are con-cerned In total 10 project manager and 14 trans-lators participated in this internal survey Whilenot all had a clear idea of what could be imple-mented some interesting suggestions came up
We gathered ideas such as scheduling automaticTM reorganisation to prevent that large TMs endup corrupted It is a well known fact that largeTMs need to be periodically reorganized (ie re-indexed and eventually cleaned-up) While thisfeature is available in standard CAT tools such asStudio 20143 and memoQ4 it is not always car-ried out automatically and it is difficult to estimate
3In previous versions of Studio TM reorganisationwas available for all types of TMs in the Transla-tion Memory Settings Dialog Box gt Performance andTuning As of Studio 2014 file-based TMs are au-tomatically reorganised while server-based TMs re-quire periodical reorganisation For further informa-tion see httpproducthelpsdlcomSDL20Trados20Studioclient_enRefO-TTM_SettingsTranslation_Memory_Settings_Dialog_Box__Performance_and_Tuninghtm
4memoQ actually has a repairing function the Trans-lation memory repair wizard which aims at repairing(ie re-indexing) a corrupted TM According to thedocumentation it is also possible to run this functionon TMs which are not corrupted For further infor-mation see httpkilgraycommemoq2015help-enindexhtmlrepair_resource3html
when such reorganisation should be carried outScheduling it to be run automatically when a par-ticular number of segments (eg 500) have beenadded since the last reorganisation for instancewould prevent the loss of a TM because of badmaintenance
Ideas more related to NLP included the auto-matic correction of orthotypography in the tar-get language and allowing for multilingual TMswhere several source and target languages can beused for concordance searches at once Althoughcurrently it is possible to use multilingual TMs inCAT tools when starting a new project a languagepair has to be selected This leads to an underusageof the TM and the potential benefits of queryingmultiple languages at once are missed If for in-stance the TM contains a translation unit (TU) fora different pair of languages (eg German gt Ital-ian) than the ones selected for that specific project(eg German gt French) and the same source sen-tence is appearing in the text currently being trans-lated the translation into a different target lan-guage will not be shown (ie the Italian translationof the German TU will not be matched) The samewould occur with concordance searches Whilematches for a different language pair will not beused in the translation they may be useful for car-rying it out If a translator understands other targetlanguages these translations may give them a hintas to how to translate the same sentence into theirmother tongue5
Finally an interesting idea was to generate newsegments on the fly from fragments of previouslytranslated segments Flanagan (2015) offers aninteresting overview about the techniques usedby different CAT tools for subsegment matchingHere we will focus on memoQrsquos such functional-ity fragment assembly6
Figure 1 shows how this functionality worksin memoQ As may be observed memoQ looks
5For SDL Studio there seems to be an external appAnyTM that allows users to use TMs having differentlanguage pairs than the ones in the current translationproject As of Studio 2015 this app has been integrated inthe CAT tool and become a new feature However this tooldoes not seem to support the usage of truly multilingual TMs(TMs including several target languages for each segment)For further information on the tool see httpwwwtranslationzonecomopenexchangeappsdltradosstudioanytmtranslationprovider-669html)
6For further information see httpkilgraycommemoq2015help-enindexhtmlfragment_assemblyhtml
2
for fragments of the source sentence in other TMsegments and internally computes their alignmentprobabilities It then inserts the translations intothe source segment and suggests this new some-times partially translated sentence as a match Al-ternatively only the fragments translated will beinserted in the target segment one after anotherwithout the source sentence words that could notbe retrieved
Figure 1 memoQrsquos fragment assembly function-ality
One limitation of this functionality is that thefragment translations follow the order in whichthey appear in the source language Thus whileit may be very useful for pairs of languages whichfollow a similar structure it may be problematicfor pairs of languages which require reorderingmemoQ uses the frequencies of apparition of thefragments to select one translation or another foreach particular segment As a consequence insome cases the translation selection is wrong thusyielding wrong translation suggestions
Examples 1 and 2 show two similar sentencesin our TM
(1) EN The following message then appearsClick accept to run the programES Aparece el siguiente mensaje Hagaclick en aceptar para ejecutar el programa
(2) EN The window will show the followingThe application will closeES La ventana mostraraacute lo siguiente Secerraraacute la aplicacioacuten
Now imagine we need to translate the sentencein 3
(3) EN The following message then appearsThe application will close
Taking the part of 1 before the colon and the partof 2 after the colon we would be able to producethe right translation as shown in 4
(4) ES Aparece el siguiente mensaje Secerraraacute la aplicacioacuten
In technical texts it is often the case that situa-tions like the one just described happen more thanonce Thus it is not surprising that the transla-tors liked the idea and thought it would be nice tofind a way of automatically retrieving their trans-lations without having to do concordance searchesin the TM Moreover remembering that a partic-ular fragment of a segment had been translatedin the past is not always possible as translatorsmay have forgotten it or different translators mayhave been involved in the project thus not seeingfragments of a segment that other translators havetranslated already
As this idea seemed to have a great potential toincrease the number of TM and fuzzy matches wedecided to implement it and test whether it actu-ally worked The next Section (4) explains howwe proceeded
4 The new segment generator
As explained in Section 3 we decided to test oneidea originated from our internal survey The ideawas to generate new TM segments from fragmentsof already existing segments We called our newtool new segment generator
The first step was to assess the type of textsthat are translated in our company and identify thesegment fragments that could be easily extractedUpon analysis of several sample texts we identi-fied 7 different types of fragments we could workwith
1 Ellipsis2 Colons3 Parenthesis4 Square brackets5 Curly braces6 Quotations7 Text surrounded by tags
In the following subsections (41 ndash 43) we de-scribe how each of these types of fragments wastreated
3
41 Ellipsis and colonsOne possible type of segment would be that inwhich an ellipsis () or a colon () is used inthe middle of the segment In software localisationor user guides sentences such as the ones in 5 and6 could appear
(5) EN Installing new services ServiceXXXX for premium clients [2]ES Instalando nuevos servicios ServicioXXXX para clientes premium [2]
(6) EN You can use the line [abcdef] todescribe any of the following characters ab c d e fES Puede utilizar la liacutenea [abcdef] paradescribir cualquiera de los siguientescaracteres a b c d e f
If a different segment only including the textbefore the ellipsis appears as in Example 7 theTM may not retrieve any fuzzy match The samewould occur with other sentences with colons inwhich the fragment before or after the colon ap-pears
(7) Installing new services
In these cases we proceded as follows
1 Check that there is an ellipsis a colon onboth the source and the target segment
2 Split the segment in two being the first partthe fragment of the segment up to the ellip-siscolon and the second part the fragment ofthe segment after the ellipsiscolon
3 Create a new TM segment for each fragment
42 Parenthesis square brackets and curlybraces
Sometimes a sentence includes a fragment be-tween parethesis square brackets or curly bracesThe content within such characters may constitutea new segment on its own or appear in a differ-ent sentence At the same time it may also be thecase that the same sentence appears in the text butwithout such parenthesis When sentences like theones in Examples 8ndash10 appear it may thus be de-sirable to store the translation of the fragment be-tween the aforementioned characters and the sen-tence without such content
(8) EN Creates an installation package forapplication installation (if it was not createdearlier)ES Crea un paquete de instalacioacuten para lainstalacioacuten de la aplicacioacuten (si no se creoacuteantes)
(9) EN ltreturn code 1gt=[ltdescriptiongt]ES ltcoacutedigo de retorno 1gt=[ltdescripcioacutengt]
(10) EN Could not open key [2] Systemerror [3] Verify that you have sufficientaccess to that key or contact your supportpersonnelES Error al abrir la clave [2] Error en elsistema [3] Compruebe que dispone delos derechos de acceso
The strategy to create new segments was the fol-lowing
1 Check that there is content between parenthe-sis square brackets curly braces on both thesource and the target segment
2 Keep three fragments out of each sentence
(a) A sentence removing those charactersand the content between them
(b) A fragment starting at the opening char-acter and finishing on the closing oneand including the content within In thisfragment the parenthesis square brack-ets or curly braces are mantained
(c) A fragment containing only the con-tent withing those characters (parenthe-sis square brackets or curly braces) butwithout them
3 Create a new TM segment for each fragment
At this preliminary stage we considered thatwhen a sentence has several clauses in parenthesissquare brackets or curly braces these appear in thesame order in the target language This was doneso because for the type of texts used so far to testour application (software manuals) and the pair oflanguages used (English into Spanish) this seemsto be the usual case In future work we plan to fur-ther evaluate this issue and consider other waysof ensuring that the right translation is assigned toeach clause
4
43 Quotations and text within tagsQuotations and double tags appearing in the textwere handled differently As the text within thequotations or tags might be part of the sentencewhere it appears it could not be removed withoutadding too much noise to the data Thus we iden-tified sentences with quotations andor tags wethen removed the quotations andor tags and keptthe same sentence without them as a new segmentFinally we also kept the text within the quotationmarks or tags as new segments Examples 11ndash12illustrate this kind of segments
(11) EN You can only set the values of settingsthat the policy allows to be modified that isunlocked settingsES Solo se pueden establecer los valoresde los paraacutemetros que al directiva permitemodificar es decir los paraacutemetrosdesbloqueados
(12) EN If you clear the lt1gtInherit settingsfrom parent policylt1gt check box in thelt2gtSettings inheritancelt2gt section of thelt3gtGenerallt3gt section in the propertieswindow of an inherited policy the lockis lifted for that policyES Si anula la seleccioacuten de la casillalt1gtHeredar configuracioacuten de la directivaprimarialt1gt en la seccioacuten lt2gtHerencia deconfiguracioacutenlt2gt que aparece en la seccioacutenlt3gtGenerallt3gt de la ventana depropiedades de una directiva heredada seabriraacute el candado para esa directiva
5 Pilot test
Before integrating our system in a CAT tool andin our normal production workflows we deemedit better to run a pilot test The aim of this testwas to measure to which extent the new segmentsretrieved an increased number of 100 and fuzzymatches
51 Test setWe used as a test set a real translation project com-ing from one of our clients It is a software manualwritten in English and to be translated into Span-ish We selected memoQ 2015 to be the CAT toolused for our testing because it is one of the com-mon CAT tools used by our translators and be-cause we also wanted to measure the impact of
our approach when using its fragment assemblyfunctionality
The project had in total 425 segments account-ing for 6280 words according to memoQ Ta-ble 1 shows the project statistics as provided bymemoQrsquos analysis tool using the project TM pro-vided by the client Additionally memoQ iden-tified 36 segments (418 words) which could betranslated benefiting from its fragment assemblyfunctionality which uses fragments of segments tocreate new translations
TM match Words SegmentsRepetitions 1064 80
100 0 095-99 4 285-94 0 075-84 285 14
50-74 2523 187No Match 2404 142
Total 6280 425
Table 1 Project statistics according to memoQ us-ing the project TM
Taking this analysis as the starting point of ourpilot test we generated new segments using theapproach described in Section 4 We used threedifferent TMs to further assess whether the sizeof the translation memory matters for generatingtranslations of segments and retrieving more trans-lations The first TM (Project TM) was the projectTM provided by the client The second TM (Prod-uct TM) was a TM including all projects done forthe same product of the client Finally the thirdTM (Client TM) included all projects of that clientand thus was the biggest one Table 2 summarisesthe size of the three TMs
Segments WordsEN ES
Project TM 16842 212472 244159456631
Product TM 20923 274542 317797592339
Client TM 256099 3427861 39517327379593
Table 2 Size of the different TMs used
We then generated new TM segments and storedthem as new TMs Table 3 shows the number ofnew segments generated using our approach
Table 4 breaks down the number of segmentsgenerated using each strategy and for each TM
5
Segments EN ESProject TM 6776 56973 66297
new segments 123270Product TM 7760 71034 83125
new segments 154159new Client TM 74041 662714 769705new segments 1432419
Table 3 Size of the new TMs generated using ourapproach
used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific
TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792
Table 4 Number of newly generated segments pertype and TM used
We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following
1 TM1 The project TM as provided by theclient
2 TM2 Only the new segments generatedfrom the project TM provided by the client
3 TM3 A combination of the project TM andthe new segments retrieved from it
4 TM4 Only the new segments generatedfrom the product TM
5 TM5 The project TM and the new segmentsgenerated from the product TM
6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM
7 TM7 Only the new segments generatedfrom the client TM
8 TM8 The project TM combined with thenew segments generated from the client TM
9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM
10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM
11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM
The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation
As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)
However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments
When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-
6
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194
Table 5 Overview on the number of segments pre-translated translated using fragments or not started
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11
100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87
Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band
erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)
If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band
In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations
from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators
6 Conclusion and future work
In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent
We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators
We have also envisioned the combination ofalready existing methods for retrieving a higher
7
number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)
Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesKevin Flanagan 2015 Subsegment recall in Transla-
tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January
Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)
Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT
Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria
Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008
Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters
Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas
Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT
8
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015
Spotting false translation segments in translation memories
Eduard BarbuTranslatednet
eduardtranslatednet
Abstract
The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories
1 Introduction
MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface
The quality of the translations using the firstmethod is high and the errors are relatively few
1httpsmymemorytranslatednet
6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V
Figure 1 The distribution of automatically ac-quired memories in MyMemory
However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions
The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task
The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments
9
2 Related Work
The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval
There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora
A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours
2httpwwwxbenchnet
3 The data
In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types
1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web
2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)
3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments
4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)
The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors
10
It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1
CG =ls minus ldradic
34(ls + ld)(1)
In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments
If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426
The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the
3This simple idea is implemented in many sentence align-ers
4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom
Church Gale ScoreP
rob
ab
ility C
hu
rch
Ga
le S
co
re
minus4 minus2 0 2 4
00
02
04
06
08
Figure 2 The distribution of Church Gale Scoresin the Matecat Set
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
01
02
03
04
Figure 3 The distribution of Church Gale Scoresin the Collaborative Set
11
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
00
00
04
00
08
Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet
Church Gale score varies in the interval minus131516015
To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6
In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows
bull Training Set It contains 1243 bi-segmentsand has 373 negative example
bull Test Set It contains 309 bi-segments and has87 negatives examples
The proportion of the negative examples in bothsets is approximately 30
minus4 minus2 0 2 4
minus4
minus2
02
4Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 5 The Q-Q plot for the Matecat set
minus4 minus2 0 2 4
minus1
00
minus5
00
50
Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 6 The Q-Q plot for the Collaborative set
12
4 Machine Learning
In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them
41 FeaturesThe features computed for the training and test setare the following
bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt
bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language
bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0
bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0
bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0
bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0
bull has capital letters The value of the featureis 1 if the source or target segments contain
words that have at least a capital letter other-wise is 0
bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters
bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions
bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)
bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties
bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity
bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity
13
bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment
bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters
bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters
bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2
All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation
The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment
is done with the highly accurate language detectorCybozu5
42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)
bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree
bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees
bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores
bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms
bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances
bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can
5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd
14
give good results in classification problemswhere the decision boundary is irregular
5 Results and discussion
We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1
Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071
Table 1 The results of the three-fold stratifiedclassification
Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)
The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2
The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set
Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059
Table 2 The results of the classification on the testset
Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section
6 Conclusions and further work
In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the
15
performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database
There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative
Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471
ReferencesWilliam A Gale and Kenneth W Church 1993 A
program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS
B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April
Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-
lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31
Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA
F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830
Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226
Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory
Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas
16
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015
Abstract
We propose the integration of clause
splitting as a pre-processing step for match
retrieval in Translation Memory (TM)
systems to increase the number of relevant
sub-segment matches Through a series of
experiments we investigate the impact of
clause splitting in instances where the input
does not match an entire segment in the
TM but only a clause from a segment Our
results show that there is a statistically
significant increase in the number of
retrieved matches when both the input
segments and the segments in the TM are
first processed with a clause splitter
1 Rationale
Translation memory tools have had a great
impact on the translation industry as they
provide considerable assistance to translators
They allow translators to easily re-use
previous translations providing them with
valuable productivity gains in an industry
where there is a great demand for quality
translation delivered in the shortest possible
time However existing tools have some
shortcomings The majority of existing tools
rely on Levenshtein distance and seek to
identify matches only at the sentence level
Semantically similar segments are therefore
difficult to retrieve if the string similarity is not
high enough as are sub-segment matches
because if only part of a sentence matches the
input even if this part is an entire clause it is
unlikely that this sentence would be retrieved
(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text
types such as updated versions of technical
manuals
In this study we aim to address the
problem of retrieving sub-segment matches by
performing clause splitting on the source
segment as a pre-processing step for TM match
retrieval While matches for entire sentences
or almost entire sentences are the most useful
type of matches it is also less likely for such
matches to be found in most text types and
even less so for complex sentences Retrieving
clauses is desirable because there is a higher
chance for a match to be found for a clause than
for a complex sentence and at the same time
clauses are similar to sentences in that they
both contain a subject and a verb hence a
ldquocomplete thoughtrdquo therefore clause matches
are more likely to be in context and to actually
be used by the translator than phrase matches
for example
We perform experiments comparing the
match retrieval performance of TM tools when
they are used as is and when the input file and
the TM segments are first processed with a
clause splitter before being fed into the TM
tool The paper is organised as follows In
section 2 we discuss related work on TM
matching In section 3 we discuss how clause
splitting can be beneficial to TM matching and
how clause splitting was implemented for this
study We then describe our experiments in
section 4 and discuss the results in section 5
Finally we present our conclusion and future
work in section 6
Improving Translation Memory Matching through Clause
Splitting
Katerina Timonera
Research Group in Computational
Linguistics
University of Wolverhampton
krtimoneragmailcom
Ruslan Mitkov
Research Group in Computational
Linguistics
University of Wolverhampton
rmitkovwlvacuk
17
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
Organizers
Constantin Orasan University of Wolverhampton UK
Rohit Gupta University of Wolverhampton UK
Program Committee
Manuel Arcedillo Hermes SpainJuanjo Arevalillo Hermes SpainEduard Barbu Translated ItalyYves Champollion WordFast FranceGloria Corpas University of Malaga SpainMaud Ehrmann EPFL SwitzerlandKevin Flanagan Swansea University UKGabriela Gonzalez eTrad ArgentinaManuel Herranz Pangeanic SpainQun Liu DCU IrelandRuslan Mitkov University of Wolverhampton UKGabor Proszeky Morphologic HungaryUwe Reinke Cologne University of Applied Sciences GermanyMichel Simard NRC CanadaMark Shuttleworth UCL UKMasao Utiyama NICT JapanAndy Way DCU IrelandMarcos Zampieri Saarland University and DFKI GermanyVentsislav Zhechev Autodesk
Invited Speaker
Marcello Federico Fondazione Bruno Kessler Italy
v
Table of Contents
Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten 1
Spotting false translation segments in translation memoriesEduard Barbu 9
Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov 17
Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou 24
Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Meth-ods
Viacutet Baisa Ales Horak and Marek Medvedrsquo 31
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Vela and Josef van
Genabith 36
vii
Conference Program
900ndash910 WelcomeConstantin Orasan
910ndash1010 Automatically tidying up and extending translation memories (invited talk)Marcello Federico FBK Italy
1015ndash1100 Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten
1130ndash1215 Spotting false translation segments in translation memoriesEduard Barbu
1215ndash1300 Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov
1430ndash1500 Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou
1500ndash1530 Increasing Coverage of Translation Memories with Linguistically Motivated Seg-ment Combination MethodsViacutet Baisa Ales Horak and Marek Medvedrsquo
1530ndash1600 CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Velaand Josef van Genabith
1615ndash1700 Round table and final discussion
ix
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 1ndash8Hissar Bulgaria Sept 2015
Creation of new TM segments Fulfilling translatorsrsquo wishes
Carla Parra EscartiacutenHermes Traducciones
C Coacutelquide 6 portal 2 3ž I28230 Las Rozas Madrid Spain
carlaparrahermestranscom
Abstract
This paper aims at bridging the gap be-tween academic research on TranslationMemories (TMs) and the actual needs andwishes of translators Starting from aninternal survey in a translation companywe analyse what translators wished trans-lation memories offered them We are cur-rently implementing one of their sugges-tions to assess its feasibility and whetheror not it retrieves more TM matches asoriginally expected We report on how thesuggestion is being implemented and theresults obtained in a pilot study
1 Introduction
Professional translators use translation memorieson a daily basis In fact most translation projectsnowadays require the usage of Computer AssistedTranslation (CAT) tools Sometimes translatorswill freely choose to work with a particular tooland sometimes it will be the client who imposesthe usage of such tool One of the main advantagesof CAT tools is that they allow for the integrationof different productivity enhancement features
Translation Memories (TMs) are used to re-trieve past translations and partial past translations(fuzzy matches) Terminology databases (TBs) areused to enhance the coherence on terminology andto ensure that the right terminology is used in allprojects Moreover some CAT tools offer addi-tional functionalities such as the usage of predic-tive texts (Autosuggest in SDL Trados Studio1
and Muses in MemoQ2) the automatic assem-bly of fragments to produce translations of newsegments or specific customizable Quality As-surance (QA) features
1wwwsdlcom2wwwmemoqcom
In the context of a translation company the us-age of these productivity enhacement tools is partof the whole translation workflow Project man-agers use them to generate word counts and esti-mate the time and resources needed to make thetranslation They also use them to pre-translatefiles and clean them up prior to delivery to theclient Translators use these tools to translate re-vise and proofread the translations prior to finaldelivery
Several researchers have worked on enhancingTMs using Natural Language Processing (NLP)techniques (Hodaacutesz and Pohl 2005 Pekar andMitkov 2007 Mitkov and Corpas 2008 Simardand Fujita 2012 Gupta and Orasan 2014Vanallemeersch and Vandeghinste 2015 Guptaet al 2015) Despite reporting positive resultsit seems that the gap between academic researchand industry still exists We carried out an inter-nal survey in our company to detect potential newfeatures for TMs that could be implemented byour RampD team The aim was to identify potentialnew features that project managers and translatorswished for and that would enhance their produc-tivity In this paper we report on the results of thatsurvey and further explain how we are implement-ing one of the suggestions we received The re-mainder of this paper is structured as follows Sec-tion 3 summarises the survey we carried out andthe replies we got and our company is briefly in-troduced in Section 2 Section 4 explains how weimplemented one of the suggestions we receivedSubsections 41ndash43 describe how we created newTM segments out of existing TMs Section 5 re-ports on the evaluation on a pilot test to assess thereal usability of this approach Finally section 6summarises our work and discusses future work
2 Our company
Hermes is a leader company in the translation in-dustry in Spain It was founded in 1991 and has
1
a vast experience in multilingual localisation andtranslation projects With offices in Madrid andMaacutelaga the company has broad knowledge andexperience in computer-assisted translation soft-ware and specific localisation software includingSDL Trados memoQ Deacutejagrave Vu IBM Translation-Manager Star Transit WordFast Catalyst Pas-solo Across Idiom World Server Microsoft He-lium and Microsoft Localisation Studio amongothers
Our company objectives are built upon a solidfoundation which have allowed us to achieve adouble quality certification for translation services(UNE-EN-150382006 and ISO-90012008 stan-dards) Our in-house translators and project man-agers commit daily to provide our clients with highquality translations for different specialised fields(IT medicine technical manuals general textsetc)
3 Internal survey
We asked our in-house translators and projectmanagers for potential new functionalities thatCAT tools could offer them More concretelywe asked them which was according to themthe missing functionality as far as TMs are con-cerned In total 10 project manager and 14 trans-lators participated in this internal survey Whilenot all had a clear idea of what could be imple-mented some interesting suggestions came up
We gathered ideas such as scheduling automaticTM reorganisation to prevent that large TMs endup corrupted It is a well known fact that largeTMs need to be periodically reorganized (ie re-indexed and eventually cleaned-up) While thisfeature is available in standard CAT tools such asStudio 20143 and memoQ4 it is not always car-ried out automatically and it is difficult to estimate
3In previous versions of Studio TM reorganisationwas available for all types of TMs in the Transla-tion Memory Settings Dialog Box gt Performance andTuning As of Studio 2014 file-based TMs are au-tomatically reorganised while server-based TMs re-quire periodical reorganisation For further informa-tion see httpproducthelpsdlcomSDL20Trados20Studioclient_enRefO-TTM_SettingsTranslation_Memory_Settings_Dialog_Box__Performance_and_Tuninghtm
4memoQ actually has a repairing function the Trans-lation memory repair wizard which aims at repairing(ie re-indexing) a corrupted TM According to thedocumentation it is also possible to run this functionon TMs which are not corrupted For further infor-mation see httpkilgraycommemoq2015help-enindexhtmlrepair_resource3html
when such reorganisation should be carried outScheduling it to be run automatically when a par-ticular number of segments (eg 500) have beenadded since the last reorganisation for instancewould prevent the loss of a TM because of badmaintenance
Ideas more related to NLP included the auto-matic correction of orthotypography in the tar-get language and allowing for multilingual TMswhere several source and target languages can beused for concordance searches at once Althoughcurrently it is possible to use multilingual TMs inCAT tools when starting a new project a languagepair has to be selected This leads to an underusageof the TM and the potential benefits of queryingmultiple languages at once are missed If for in-stance the TM contains a translation unit (TU) fora different pair of languages (eg German gt Ital-ian) than the ones selected for that specific project(eg German gt French) and the same source sen-tence is appearing in the text currently being trans-lated the translation into a different target lan-guage will not be shown (ie the Italian translationof the German TU will not be matched) The samewould occur with concordance searches Whilematches for a different language pair will not beused in the translation they may be useful for car-rying it out If a translator understands other targetlanguages these translations may give them a hintas to how to translate the same sentence into theirmother tongue5
Finally an interesting idea was to generate newsegments on the fly from fragments of previouslytranslated segments Flanagan (2015) offers aninteresting overview about the techniques usedby different CAT tools for subsegment matchingHere we will focus on memoQrsquos such functional-ity fragment assembly6
Figure 1 shows how this functionality worksin memoQ As may be observed memoQ looks
5For SDL Studio there seems to be an external appAnyTM that allows users to use TMs having differentlanguage pairs than the ones in the current translationproject As of Studio 2015 this app has been integrated inthe CAT tool and become a new feature However this tooldoes not seem to support the usage of truly multilingual TMs(TMs including several target languages for each segment)For further information on the tool see httpwwwtranslationzonecomopenexchangeappsdltradosstudioanytmtranslationprovider-669html)
6For further information see httpkilgraycommemoq2015help-enindexhtmlfragment_assemblyhtml
2
for fragments of the source sentence in other TMsegments and internally computes their alignmentprobabilities It then inserts the translations intothe source segment and suggests this new some-times partially translated sentence as a match Al-ternatively only the fragments translated will beinserted in the target segment one after anotherwithout the source sentence words that could notbe retrieved
Figure 1 memoQrsquos fragment assembly function-ality
One limitation of this functionality is that thefragment translations follow the order in whichthey appear in the source language Thus whileit may be very useful for pairs of languages whichfollow a similar structure it may be problematicfor pairs of languages which require reorderingmemoQ uses the frequencies of apparition of thefragments to select one translation or another foreach particular segment As a consequence insome cases the translation selection is wrong thusyielding wrong translation suggestions
Examples 1 and 2 show two similar sentencesin our TM
(1) EN The following message then appearsClick accept to run the programES Aparece el siguiente mensaje Hagaclick en aceptar para ejecutar el programa
(2) EN The window will show the followingThe application will closeES La ventana mostraraacute lo siguiente Secerraraacute la aplicacioacuten
Now imagine we need to translate the sentencein 3
(3) EN The following message then appearsThe application will close
Taking the part of 1 before the colon and the partof 2 after the colon we would be able to producethe right translation as shown in 4
(4) ES Aparece el siguiente mensaje Secerraraacute la aplicacioacuten
In technical texts it is often the case that situa-tions like the one just described happen more thanonce Thus it is not surprising that the transla-tors liked the idea and thought it would be nice tofind a way of automatically retrieving their trans-lations without having to do concordance searchesin the TM Moreover remembering that a partic-ular fragment of a segment had been translatedin the past is not always possible as translatorsmay have forgotten it or different translators mayhave been involved in the project thus not seeingfragments of a segment that other translators havetranslated already
As this idea seemed to have a great potential toincrease the number of TM and fuzzy matches wedecided to implement it and test whether it actu-ally worked The next Section (4) explains howwe proceeded
4 The new segment generator
As explained in Section 3 we decided to test oneidea originated from our internal survey The ideawas to generate new TM segments from fragmentsof already existing segments We called our newtool new segment generator
The first step was to assess the type of textsthat are translated in our company and identify thesegment fragments that could be easily extractedUpon analysis of several sample texts we identi-fied 7 different types of fragments we could workwith
1 Ellipsis2 Colons3 Parenthesis4 Square brackets5 Curly braces6 Quotations7 Text surrounded by tags
In the following subsections (41 ndash 43) we de-scribe how each of these types of fragments wastreated
3
41 Ellipsis and colonsOne possible type of segment would be that inwhich an ellipsis () or a colon () is used inthe middle of the segment In software localisationor user guides sentences such as the ones in 5 and6 could appear
(5) EN Installing new services ServiceXXXX for premium clients [2]ES Instalando nuevos servicios ServicioXXXX para clientes premium [2]
(6) EN You can use the line [abcdef] todescribe any of the following characters ab c d e fES Puede utilizar la liacutenea [abcdef] paradescribir cualquiera de los siguientescaracteres a b c d e f
If a different segment only including the textbefore the ellipsis appears as in Example 7 theTM may not retrieve any fuzzy match The samewould occur with other sentences with colons inwhich the fragment before or after the colon ap-pears
(7) Installing new services
In these cases we proceded as follows
1 Check that there is an ellipsis a colon onboth the source and the target segment
2 Split the segment in two being the first partthe fragment of the segment up to the ellip-siscolon and the second part the fragment ofthe segment after the ellipsiscolon
3 Create a new TM segment for each fragment
42 Parenthesis square brackets and curlybraces
Sometimes a sentence includes a fragment be-tween parethesis square brackets or curly bracesThe content within such characters may constitutea new segment on its own or appear in a differ-ent sentence At the same time it may also be thecase that the same sentence appears in the text butwithout such parenthesis When sentences like theones in Examples 8ndash10 appear it may thus be de-sirable to store the translation of the fragment be-tween the aforementioned characters and the sen-tence without such content
(8) EN Creates an installation package forapplication installation (if it was not createdearlier)ES Crea un paquete de instalacioacuten para lainstalacioacuten de la aplicacioacuten (si no se creoacuteantes)
(9) EN ltreturn code 1gt=[ltdescriptiongt]ES ltcoacutedigo de retorno 1gt=[ltdescripcioacutengt]
(10) EN Could not open key [2] Systemerror [3] Verify that you have sufficientaccess to that key or contact your supportpersonnelES Error al abrir la clave [2] Error en elsistema [3] Compruebe que dispone delos derechos de acceso
The strategy to create new segments was the fol-lowing
1 Check that there is content between parenthe-sis square brackets curly braces on both thesource and the target segment
2 Keep three fragments out of each sentence
(a) A sentence removing those charactersand the content between them
(b) A fragment starting at the opening char-acter and finishing on the closing oneand including the content within In thisfragment the parenthesis square brack-ets or curly braces are mantained
(c) A fragment containing only the con-tent withing those characters (parenthe-sis square brackets or curly braces) butwithout them
3 Create a new TM segment for each fragment
At this preliminary stage we considered thatwhen a sentence has several clauses in parenthesissquare brackets or curly braces these appear in thesame order in the target language This was doneso because for the type of texts used so far to testour application (software manuals) and the pair oflanguages used (English into Spanish) this seemsto be the usual case In future work we plan to fur-ther evaluate this issue and consider other waysof ensuring that the right translation is assigned toeach clause
4
43 Quotations and text within tagsQuotations and double tags appearing in the textwere handled differently As the text within thequotations or tags might be part of the sentencewhere it appears it could not be removed withoutadding too much noise to the data Thus we iden-tified sentences with quotations andor tags wethen removed the quotations andor tags and keptthe same sentence without them as a new segmentFinally we also kept the text within the quotationmarks or tags as new segments Examples 11ndash12illustrate this kind of segments
(11) EN You can only set the values of settingsthat the policy allows to be modified that isunlocked settingsES Solo se pueden establecer los valoresde los paraacutemetros que al directiva permitemodificar es decir los paraacutemetrosdesbloqueados
(12) EN If you clear the lt1gtInherit settingsfrom parent policylt1gt check box in thelt2gtSettings inheritancelt2gt section of thelt3gtGenerallt3gt section in the propertieswindow of an inherited policy the lockis lifted for that policyES Si anula la seleccioacuten de la casillalt1gtHeredar configuracioacuten de la directivaprimarialt1gt en la seccioacuten lt2gtHerencia deconfiguracioacutenlt2gt que aparece en la seccioacutenlt3gtGenerallt3gt de la ventana depropiedades de una directiva heredada seabriraacute el candado para esa directiva
5 Pilot test
Before integrating our system in a CAT tool andin our normal production workflows we deemedit better to run a pilot test The aim of this testwas to measure to which extent the new segmentsretrieved an increased number of 100 and fuzzymatches
51 Test setWe used as a test set a real translation project com-ing from one of our clients It is a software manualwritten in English and to be translated into Span-ish We selected memoQ 2015 to be the CAT toolused for our testing because it is one of the com-mon CAT tools used by our translators and be-cause we also wanted to measure the impact of
our approach when using its fragment assemblyfunctionality
The project had in total 425 segments account-ing for 6280 words according to memoQ Ta-ble 1 shows the project statistics as provided bymemoQrsquos analysis tool using the project TM pro-vided by the client Additionally memoQ iden-tified 36 segments (418 words) which could betranslated benefiting from its fragment assemblyfunctionality which uses fragments of segments tocreate new translations
TM match Words SegmentsRepetitions 1064 80
100 0 095-99 4 285-94 0 075-84 285 14
50-74 2523 187No Match 2404 142
Total 6280 425
Table 1 Project statistics according to memoQ us-ing the project TM
Taking this analysis as the starting point of ourpilot test we generated new segments using theapproach described in Section 4 We used threedifferent TMs to further assess whether the sizeof the translation memory matters for generatingtranslations of segments and retrieving more trans-lations The first TM (Project TM) was the projectTM provided by the client The second TM (Prod-uct TM) was a TM including all projects done forthe same product of the client Finally the thirdTM (Client TM) included all projects of that clientand thus was the biggest one Table 2 summarisesthe size of the three TMs
Segments WordsEN ES
Project TM 16842 212472 244159456631
Product TM 20923 274542 317797592339
Client TM 256099 3427861 39517327379593
Table 2 Size of the different TMs used
We then generated new TM segments and storedthem as new TMs Table 3 shows the number ofnew segments generated using our approach
Table 4 breaks down the number of segmentsgenerated using each strategy and for each TM
5
Segments EN ESProject TM 6776 56973 66297
new segments 123270Product TM 7760 71034 83125
new segments 154159new Client TM 74041 662714 769705new segments 1432419
Table 3 Size of the new TMs generated using ourapproach
used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific
TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792
Table 4 Number of newly generated segments pertype and TM used
We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following
1 TM1 The project TM as provided by theclient
2 TM2 Only the new segments generatedfrom the project TM provided by the client
3 TM3 A combination of the project TM andthe new segments retrieved from it
4 TM4 Only the new segments generatedfrom the product TM
5 TM5 The project TM and the new segmentsgenerated from the product TM
6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM
7 TM7 Only the new segments generatedfrom the client TM
8 TM8 The project TM combined with thenew segments generated from the client TM
9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM
10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM
11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM
The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation
As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)
However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments
When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-
6
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194
Table 5 Overview on the number of segments pre-translated translated using fragments or not started
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11
100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87
Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band
erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)
If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band
In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations
from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators
6 Conclusion and future work
In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent
We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators
We have also envisioned the combination ofalready existing methods for retrieving a higher
7
number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)
Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesKevin Flanagan 2015 Subsegment recall in Transla-
tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January
Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)
Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT
Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria
Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008
Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters
Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas
Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT
8
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015
Spotting false translation segments in translation memories
Eduard BarbuTranslatednet
eduardtranslatednet
Abstract
The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories
1 Introduction
MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface
The quality of the translations using the firstmethod is high and the errors are relatively few
1httpsmymemorytranslatednet
6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V
Figure 1 The distribution of automatically ac-quired memories in MyMemory
However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions
The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task
The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments
9
2 Related Work
The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval
There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora
A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours
2httpwwwxbenchnet
3 The data
In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types
1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web
2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)
3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments
4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)
The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors
10
It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1
CG =ls minus ldradic
34(ls + ld)(1)
In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments
If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426
The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the
3This simple idea is implemented in many sentence align-ers
4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom
Church Gale ScoreP
rob
ab
ility C
hu
rch
Ga
le S
co
re
minus4 minus2 0 2 4
00
02
04
06
08
Figure 2 The distribution of Church Gale Scoresin the Matecat Set
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
01
02
03
04
Figure 3 The distribution of Church Gale Scoresin the Collaborative Set
11
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
00
00
04
00
08
Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet
Church Gale score varies in the interval minus131516015
To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6
In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows
bull Training Set It contains 1243 bi-segmentsand has 373 negative example
bull Test Set It contains 309 bi-segments and has87 negatives examples
The proportion of the negative examples in bothsets is approximately 30
minus4 minus2 0 2 4
minus4
minus2
02
4Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 5 The Q-Q plot for the Matecat set
minus4 minus2 0 2 4
minus1
00
minus5
00
50
Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 6 The Q-Q plot for the Collaborative set
12
4 Machine Learning
In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them
41 FeaturesThe features computed for the training and test setare the following
bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt
bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language
bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0
bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0
bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0
bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0
bull has capital letters The value of the featureis 1 if the source or target segments contain
words that have at least a capital letter other-wise is 0
bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters
bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions
bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)
bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties
bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity
bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity
13
bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment
bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters
bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters
bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2
All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation
The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment
is done with the highly accurate language detectorCybozu5
42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)
bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree
bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees
bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores
bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms
bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances
bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can
5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd
14
give good results in classification problemswhere the decision boundary is irregular
5 Results and discussion
We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1
Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071
Table 1 The results of the three-fold stratifiedclassification
Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)
The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2
The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set
Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059
Table 2 The results of the classification on the testset
Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section
6 Conclusions and further work
In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the
15
performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database
There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative
Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471
ReferencesWilliam A Gale and Kenneth W Church 1993 A
program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS
B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April
Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-
lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31
Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA
F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830
Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226
Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory
Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas
16
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015
Abstract
We propose the integration of clause
splitting as a pre-processing step for match
retrieval in Translation Memory (TM)
systems to increase the number of relevant
sub-segment matches Through a series of
experiments we investigate the impact of
clause splitting in instances where the input
does not match an entire segment in the
TM but only a clause from a segment Our
results show that there is a statistically
significant increase in the number of
retrieved matches when both the input
segments and the segments in the TM are
first processed with a clause splitter
1 Rationale
Translation memory tools have had a great
impact on the translation industry as they
provide considerable assistance to translators
They allow translators to easily re-use
previous translations providing them with
valuable productivity gains in an industry
where there is a great demand for quality
translation delivered in the shortest possible
time However existing tools have some
shortcomings The majority of existing tools
rely on Levenshtein distance and seek to
identify matches only at the sentence level
Semantically similar segments are therefore
difficult to retrieve if the string similarity is not
high enough as are sub-segment matches
because if only part of a sentence matches the
input even if this part is an entire clause it is
unlikely that this sentence would be retrieved
(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text
types such as updated versions of technical
manuals
In this study we aim to address the
problem of retrieving sub-segment matches by
performing clause splitting on the source
segment as a pre-processing step for TM match
retrieval While matches for entire sentences
or almost entire sentences are the most useful
type of matches it is also less likely for such
matches to be found in most text types and
even less so for complex sentences Retrieving
clauses is desirable because there is a higher
chance for a match to be found for a clause than
for a complex sentence and at the same time
clauses are similar to sentences in that they
both contain a subject and a verb hence a
ldquocomplete thoughtrdquo therefore clause matches
are more likely to be in context and to actually
be used by the translator than phrase matches
for example
We perform experiments comparing the
match retrieval performance of TM tools when
they are used as is and when the input file and
the TM segments are first processed with a
clause splitter before being fed into the TM
tool The paper is organised as follows In
section 2 we discuss related work on TM
matching In section 3 we discuss how clause
splitting can be beneficial to TM matching and
how clause splitting was implemented for this
study We then describe our experiments in
section 4 and discuss the results in section 5
Finally we present our conclusion and future
work in section 6
Improving Translation Memory Matching through Clause
Splitting
Katerina Timonera
Research Group in Computational
Linguistics
University of Wolverhampton
krtimoneragmailcom
Ruslan Mitkov
Research Group in Computational
Linguistics
University of Wolverhampton
rmitkovwlvacuk
17
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
Table of Contents
Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten 1
Spotting false translation segments in translation memoriesEduard Barbu 9
Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov 17
Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou 24
Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Meth-ods
Viacutet Baisa Ales Horak and Marek Medvedrsquo 31
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Vela and Josef van
Genabith 36
vii
Conference Program
900ndash910 WelcomeConstantin Orasan
910ndash1010 Automatically tidying up and extending translation memories (invited talk)Marcello Federico FBK Italy
1015ndash1100 Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten
1130ndash1215 Spotting false translation segments in translation memoriesEduard Barbu
1215ndash1300 Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov
1430ndash1500 Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou
1500ndash1530 Increasing Coverage of Translation Memories with Linguistically Motivated Seg-ment Combination MethodsViacutet Baisa Ales Horak and Marek Medvedrsquo
1530ndash1600 CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Velaand Josef van Genabith
1615ndash1700 Round table and final discussion
ix
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 1ndash8Hissar Bulgaria Sept 2015
Creation of new TM segments Fulfilling translatorsrsquo wishes
Carla Parra EscartiacutenHermes Traducciones
C Coacutelquide 6 portal 2 3ž I28230 Las Rozas Madrid Spain
carlaparrahermestranscom
Abstract
This paper aims at bridging the gap be-tween academic research on TranslationMemories (TMs) and the actual needs andwishes of translators Starting from aninternal survey in a translation companywe analyse what translators wished trans-lation memories offered them We are cur-rently implementing one of their sugges-tions to assess its feasibility and whetheror not it retrieves more TM matches asoriginally expected We report on how thesuggestion is being implemented and theresults obtained in a pilot study
1 Introduction
Professional translators use translation memorieson a daily basis In fact most translation projectsnowadays require the usage of Computer AssistedTranslation (CAT) tools Sometimes translatorswill freely choose to work with a particular tooland sometimes it will be the client who imposesthe usage of such tool One of the main advantagesof CAT tools is that they allow for the integrationof different productivity enhancement features
Translation Memories (TMs) are used to re-trieve past translations and partial past translations(fuzzy matches) Terminology databases (TBs) areused to enhance the coherence on terminology andto ensure that the right terminology is used in allprojects Moreover some CAT tools offer addi-tional functionalities such as the usage of predic-tive texts (Autosuggest in SDL Trados Studio1
and Muses in MemoQ2) the automatic assem-bly of fragments to produce translations of newsegments or specific customizable Quality As-surance (QA) features
1wwwsdlcom2wwwmemoqcom
In the context of a translation company the us-age of these productivity enhacement tools is partof the whole translation workflow Project man-agers use them to generate word counts and esti-mate the time and resources needed to make thetranslation They also use them to pre-translatefiles and clean them up prior to delivery to theclient Translators use these tools to translate re-vise and proofread the translations prior to finaldelivery
Several researchers have worked on enhancingTMs using Natural Language Processing (NLP)techniques (Hodaacutesz and Pohl 2005 Pekar andMitkov 2007 Mitkov and Corpas 2008 Simardand Fujita 2012 Gupta and Orasan 2014Vanallemeersch and Vandeghinste 2015 Guptaet al 2015) Despite reporting positive resultsit seems that the gap between academic researchand industry still exists We carried out an inter-nal survey in our company to detect potential newfeatures for TMs that could be implemented byour RampD team The aim was to identify potentialnew features that project managers and translatorswished for and that would enhance their produc-tivity In this paper we report on the results of thatsurvey and further explain how we are implement-ing one of the suggestions we received The re-mainder of this paper is structured as follows Sec-tion 3 summarises the survey we carried out andthe replies we got and our company is briefly in-troduced in Section 2 Section 4 explains how weimplemented one of the suggestions we receivedSubsections 41ndash43 describe how we created newTM segments out of existing TMs Section 5 re-ports on the evaluation on a pilot test to assess thereal usability of this approach Finally section 6summarises our work and discusses future work
2 Our company
Hermes is a leader company in the translation in-dustry in Spain It was founded in 1991 and has
1
a vast experience in multilingual localisation andtranslation projects With offices in Madrid andMaacutelaga the company has broad knowledge andexperience in computer-assisted translation soft-ware and specific localisation software includingSDL Trados memoQ Deacutejagrave Vu IBM Translation-Manager Star Transit WordFast Catalyst Pas-solo Across Idiom World Server Microsoft He-lium and Microsoft Localisation Studio amongothers
Our company objectives are built upon a solidfoundation which have allowed us to achieve adouble quality certification for translation services(UNE-EN-150382006 and ISO-90012008 stan-dards) Our in-house translators and project man-agers commit daily to provide our clients with highquality translations for different specialised fields(IT medicine technical manuals general textsetc)
3 Internal survey
We asked our in-house translators and projectmanagers for potential new functionalities thatCAT tools could offer them More concretelywe asked them which was according to themthe missing functionality as far as TMs are con-cerned In total 10 project manager and 14 trans-lators participated in this internal survey Whilenot all had a clear idea of what could be imple-mented some interesting suggestions came up
We gathered ideas such as scheduling automaticTM reorganisation to prevent that large TMs endup corrupted It is a well known fact that largeTMs need to be periodically reorganized (ie re-indexed and eventually cleaned-up) While thisfeature is available in standard CAT tools such asStudio 20143 and memoQ4 it is not always car-ried out automatically and it is difficult to estimate
3In previous versions of Studio TM reorganisationwas available for all types of TMs in the Transla-tion Memory Settings Dialog Box gt Performance andTuning As of Studio 2014 file-based TMs are au-tomatically reorganised while server-based TMs re-quire periodical reorganisation For further informa-tion see httpproducthelpsdlcomSDL20Trados20Studioclient_enRefO-TTM_SettingsTranslation_Memory_Settings_Dialog_Box__Performance_and_Tuninghtm
4memoQ actually has a repairing function the Trans-lation memory repair wizard which aims at repairing(ie re-indexing) a corrupted TM According to thedocumentation it is also possible to run this functionon TMs which are not corrupted For further infor-mation see httpkilgraycommemoq2015help-enindexhtmlrepair_resource3html
when such reorganisation should be carried outScheduling it to be run automatically when a par-ticular number of segments (eg 500) have beenadded since the last reorganisation for instancewould prevent the loss of a TM because of badmaintenance
Ideas more related to NLP included the auto-matic correction of orthotypography in the tar-get language and allowing for multilingual TMswhere several source and target languages can beused for concordance searches at once Althoughcurrently it is possible to use multilingual TMs inCAT tools when starting a new project a languagepair has to be selected This leads to an underusageof the TM and the potential benefits of queryingmultiple languages at once are missed If for in-stance the TM contains a translation unit (TU) fora different pair of languages (eg German gt Ital-ian) than the ones selected for that specific project(eg German gt French) and the same source sen-tence is appearing in the text currently being trans-lated the translation into a different target lan-guage will not be shown (ie the Italian translationof the German TU will not be matched) The samewould occur with concordance searches Whilematches for a different language pair will not beused in the translation they may be useful for car-rying it out If a translator understands other targetlanguages these translations may give them a hintas to how to translate the same sentence into theirmother tongue5
Finally an interesting idea was to generate newsegments on the fly from fragments of previouslytranslated segments Flanagan (2015) offers aninteresting overview about the techniques usedby different CAT tools for subsegment matchingHere we will focus on memoQrsquos such functional-ity fragment assembly6
Figure 1 shows how this functionality worksin memoQ As may be observed memoQ looks
5For SDL Studio there seems to be an external appAnyTM that allows users to use TMs having differentlanguage pairs than the ones in the current translationproject As of Studio 2015 this app has been integrated inthe CAT tool and become a new feature However this tooldoes not seem to support the usage of truly multilingual TMs(TMs including several target languages for each segment)For further information on the tool see httpwwwtranslationzonecomopenexchangeappsdltradosstudioanytmtranslationprovider-669html)
6For further information see httpkilgraycommemoq2015help-enindexhtmlfragment_assemblyhtml
2
for fragments of the source sentence in other TMsegments and internally computes their alignmentprobabilities It then inserts the translations intothe source segment and suggests this new some-times partially translated sentence as a match Al-ternatively only the fragments translated will beinserted in the target segment one after anotherwithout the source sentence words that could notbe retrieved
Figure 1 memoQrsquos fragment assembly function-ality
One limitation of this functionality is that thefragment translations follow the order in whichthey appear in the source language Thus whileit may be very useful for pairs of languages whichfollow a similar structure it may be problematicfor pairs of languages which require reorderingmemoQ uses the frequencies of apparition of thefragments to select one translation or another foreach particular segment As a consequence insome cases the translation selection is wrong thusyielding wrong translation suggestions
Examples 1 and 2 show two similar sentencesin our TM
(1) EN The following message then appearsClick accept to run the programES Aparece el siguiente mensaje Hagaclick en aceptar para ejecutar el programa
(2) EN The window will show the followingThe application will closeES La ventana mostraraacute lo siguiente Secerraraacute la aplicacioacuten
Now imagine we need to translate the sentencein 3
(3) EN The following message then appearsThe application will close
Taking the part of 1 before the colon and the partof 2 after the colon we would be able to producethe right translation as shown in 4
(4) ES Aparece el siguiente mensaje Secerraraacute la aplicacioacuten
In technical texts it is often the case that situa-tions like the one just described happen more thanonce Thus it is not surprising that the transla-tors liked the idea and thought it would be nice tofind a way of automatically retrieving their trans-lations without having to do concordance searchesin the TM Moreover remembering that a partic-ular fragment of a segment had been translatedin the past is not always possible as translatorsmay have forgotten it or different translators mayhave been involved in the project thus not seeingfragments of a segment that other translators havetranslated already
As this idea seemed to have a great potential toincrease the number of TM and fuzzy matches wedecided to implement it and test whether it actu-ally worked The next Section (4) explains howwe proceeded
4 The new segment generator
As explained in Section 3 we decided to test oneidea originated from our internal survey The ideawas to generate new TM segments from fragmentsof already existing segments We called our newtool new segment generator
The first step was to assess the type of textsthat are translated in our company and identify thesegment fragments that could be easily extractedUpon analysis of several sample texts we identi-fied 7 different types of fragments we could workwith
1 Ellipsis2 Colons3 Parenthesis4 Square brackets5 Curly braces6 Quotations7 Text surrounded by tags
In the following subsections (41 ndash 43) we de-scribe how each of these types of fragments wastreated
3
41 Ellipsis and colonsOne possible type of segment would be that inwhich an ellipsis () or a colon () is used inthe middle of the segment In software localisationor user guides sentences such as the ones in 5 and6 could appear
(5) EN Installing new services ServiceXXXX for premium clients [2]ES Instalando nuevos servicios ServicioXXXX para clientes premium [2]
(6) EN You can use the line [abcdef] todescribe any of the following characters ab c d e fES Puede utilizar la liacutenea [abcdef] paradescribir cualquiera de los siguientescaracteres a b c d e f
If a different segment only including the textbefore the ellipsis appears as in Example 7 theTM may not retrieve any fuzzy match The samewould occur with other sentences with colons inwhich the fragment before or after the colon ap-pears
(7) Installing new services
In these cases we proceded as follows
1 Check that there is an ellipsis a colon onboth the source and the target segment
2 Split the segment in two being the first partthe fragment of the segment up to the ellip-siscolon and the second part the fragment ofthe segment after the ellipsiscolon
3 Create a new TM segment for each fragment
42 Parenthesis square brackets and curlybraces
Sometimes a sentence includes a fragment be-tween parethesis square brackets or curly bracesThe content within such characters may constitutea new segment on its own or appear in a differ-ent sentence At the same time it may also be thecase that the same sentence appears in the text butwithout such parenthesis When sentences like theones in Examples 8ndash10 appear it may thus be de-sirable to store the translation of the fragment be-tween the aforementioned characters and the sen-tence without such content
(8) EN Creates an installation package forapplication installation (if it was not createdearlier)ES Crea un paquete de instalacioacuten para lainstalacioacuten de la aplicacioacuten (si no se creoacuteantes)
(9) EN ltreturn code 1gt=[ltdescriptiongt]ES ltcoacutedigo de retorno 1gt=[ltdescripcioacutengt]
(10) EN Could not open key [2] Systemerror [3] Verify that you have sufficientaccess to that key or contact your supportpersonnelES Error al abrir la clave [2] Error en elsistema [3] Compruebe que dispone delos derechos de acceso
The strategy to create new segments was the fol-lowing
1 Check that there is content between parenthe-sis square brackets curly braces on both thesource and the target segment
2 Keep three fragments out of each sentence
(a) A sentence removing those charactersand the content between them
(b) A fragment starting at the opening char-acter and finishing on the closing oneand including the content within In thisfragment the parenthesis square brack-ets or curly braces are mantained
(c) A fragment containing only the con-tent withing those characters (parenthe-sis square brackets or curly braces) butwithout them
3 Create a new TM segment for each fragment
At this preliminary stage we considered thatwhen a sentence has several clauses in parenthesissquare brackets or curly braces these appear in thesame order in the target language This was doneso because for the type of texts used so far to testour application (software manuals) and the pair oflanguages used (English into Spanish) this seemsto be the usual case In future work we plan to fur-ther evaluate this issue and consider other waysof ensuring that the right translation is assigned toeach clause
4
43 Quotations and text within tagsQuotations and double tags appearing in the textwere handled differently As the text within thequotations or tags might be part of the sentencewhere it appears it could not be removed withoutadding too much noise to the data Thus we iden-tified sentences with quotations andor tags wethen removed the quotations andor tags and keptthe same sentence without them as a new segmentFinally we also kept the text within the quotationmarks or tags as new segments Examples 11ndash12illustrate this kind of segments
(11) EN You can only set the values of settingsthat the policy allows to be modified that isunlocked settingsES Solo se pueden establecer los valoresde los paraacutemetros que al directiva permitemodificar es decir los paraacutemetrosdesbloqueados
(12) EN If you clear the lt1gtInherit settingsfrom parent policylt1gt check box in thelt2gtSettings inheritancelt2gt section of thelt3gtGenerallt3gt section in the propertieswindow of an inherited policy the lockis lifted for that policyES Si anula la seleccioacuten de la casillalt1gtHeredar configuracioacuten de la directivaprimarialt1gt en la seccioacuten lt2gtHerencia deconfiguracioacutenlt2gt que aparece en la seccioacutenlt3gtGenerallt3gt de la ventana depropiedades de una directiva heredada seabriraacute el candado para esa directiva
5 Pilot test
Before integrating our system in a CAT tool andin our normal production workflows we deemedit better to run a pilot test The aim of this testwas to measure to which extent the new segmentsretrieved an increased number of 100 and fuzzymatches
51 Test setWe used as a test set a real translation project com-ing from one of our clients It is a software manualwritten in English and to be translated into Span-ish We selected memoQ 2015 to be the CAT toolused for our testing because it is one of the com-mon CAT tools used by our translators and be-cause we also wanted to measure the impact of
our approach when using its fragment assemblyfunctionality
The project had in total 425 segments account-ing for 6280 words according to memoQ Ta-ble 1 shows the project statistics as provided bymemoQrsquos analysis tool using the project TM pro-vided by the client Additionally memoQ iden-tified 36 segments (418 words) which could betranslated benefiting from its fragment assemblyfunctionality which uses fragments of segments tocreate new translations
TM match Words SegmentsRepetitions 1064 80
100 0 095-99 4 285-94 0 075-84 285 14
50-74 2523 187No Match 2404 142
Total 6280 425
Table 1 Project statistics according to memoQ us-ing the project TM
Taking this analysis as the starting point of ourpilot test we generated new segments using theapproach described in Section 4 We used threedifferent TMs to further assess whether the sizeof the translation memory matters for generatingtranslations of segments and retrieving more trans-lations The first TM (Project TM) was the projectTM provided by the client The second TM (Prod-uct TM) was a TM including all projects done forthe same product of the client Finally the thirdTM (Client TM) included all projects of that clientand thus was the biggest one Table 2 summarisesthe size of the three TMs
Segments WordsEN ES
Project TM 16842 212472 244159456631
Product TM 20923 274542 317797592339
Client TM 256099 3427861 39517327379593
Table 2 Size of the different TMs used
We then generated new TM segments and storedthem as new TMs Table 3 shows the number ofnew segments generated using our approach
Table 4 breaks down the number of segmentsgenerated using each strategy and for each TM
5
Segments EN ESProject TM 6776 56973 66297
new segments 123270Product TM 7760 71034 83125
new segments 154159new Client TM 74041 662714 769705new segments 1432419
Table 3 Size of the new TMs generated using ourapproach
used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific
TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792
Table 4 Number of newly generated segments pertype and TM used
We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following
1 TM1 The project TM as provided by theclient
2 TM2 Only the new segments generatedfrom the project TM provided by the client
3 TM3 A combination of the project TM andthe new segments retrieved from it
4 TM4 Only the new segments generatedfrom the product TM
5 TM5 The project TM and the new segmentsgenerated from the product TM
6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM
7 TM7 Only the new segments generatedfrom the client TM
8 TM8 The project TM combined with thenew segments generated from the client TM
9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM
10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM
11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM
The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation
As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)
However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments
When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-
6
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194
Table 5 Overview on the number of segments pre-translated translated using fragments or not started
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11
100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87
Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band
erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)
If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band
In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations
from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators
6 Conclusion and future work
In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent
We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators
We have also envisioned the combination ofalready existing methods for retrieving a higher
7
number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)
Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesKevin Flanagan 2015 Subsegment recall in Transla-
tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January
Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)
Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT
Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria
Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008
Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters
Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas
Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT
8
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015
Spotting false translation segments in translation memories
Eduard BarbuTranslatednet
eduardtranslatednet
Abstract
The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories
1 Introduction
MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface
The quality of the translations using the firstmethod is high and the errors are relatively few
1httpsmymemorytranslatednet
6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V
Figure 1 The distribution of automatically ac-quired memories in MyMemory
However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions
The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task
The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments
9
2 Related Work
The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval
There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora
A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours
2httpwwwxbenchnet
3 The data
In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types
1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web
2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)
3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments
4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)
The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors
10
It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1
CG =ls minus ldradic
34(ls + ld)(1)
In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments
If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426
The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the
3This simple idea is implemented in many sentence align-ers
4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom
Church Gale ScoreP
rob
ab
ility C
hu
rch
Ga
le S
co
re
minus4 minus2 0 2 4
00
02
04
06
08
Figure 2 The distribution of Church Gale Scoresin the Matecat Set
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
01
02
03
04
Figure 3 The distribution of Church Gale Scoresin the Collaborative Set
11
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
00
00
04
00
08
Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet
Church Gale score varies in the interval minus131516015
To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6
In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows
bull Training Set It contains 1243 bi-segmentsand has 373 negative example
bull Test Set It contains 309 bi-segments and has87 negatives examples
The proportion of the negative examples in bothsets is approximately 30
minus4 minus2 0 2 4
minus4
minus2
02
4Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 5 The Q-Q plot for the Matecat set
minus4 minus2 0 2 4
minus1
00
minus5
00
50
Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 6 The Q-Q plot for the Collaborative set
12
4 Machine Learning
In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them
41 FeaturesThe features computed for the training and test setare the following
bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt
bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language
bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0
bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0
bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0
bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0
bull has capital letters The value of the featureis 1 if the source or target segments contain
words that have at least a capital letter other-wise is 0
bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters
bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions
bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)
bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties
bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity
bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity
13
bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment
bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters
bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters
bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2
All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation
The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment
is done with the highly accurate language detectorCybozu5
42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)
bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree
bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees
bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores
bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms
bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances
bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can
5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd
14
give good results in classification problemswhere the decision boundary is irregular
5 Results and discussion
We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1
Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071
Table 1 The results of the three-fold stratifiedclassification
Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)
The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2
The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set
Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059
Table 2 The results of the classification on the testset
Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section
6 Conclusions and further work
In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the
15
performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database
There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative
Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471
ReferencesWilliam A Gale and Kenneth W Church 1993 A
program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS
B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April
Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-
lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31
Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA
F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830
Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226
Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory
Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas
16
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015
Abstract
We propose the integration of clause
splitting as a pre-processing step for match
retrieval in Translation Memory (TM)
systems to increase the number of relevant
sub-segment matches Through a series of
experiments we investigate the impact of
clause splitting in instances where the input
does not match an entire segment in the
TM but only a clause from a segment Our
results show that there is a statistically
significant increase in the number of
retrieved matches when both the input
segments and the segments in the TM are
first processed with a clause splitter
1 Rationale
Translation memory tools have had a great
impact on the translation industry as they
provide considerable assistance to translators
They allow translators to easily re-use
previous translations providing them with
valuable productivity gains in an industry
where there is a great demand for quality
translation delivered in the shortest possible
time However existing tools have some
shortcomings The majority of existing tools
rely on Levenshtein distance and seek to
identify matches only at the sentence level
Semantically similar segments are therefore
difficult to retrieve if the string similarity is not
high enough as are sub-segment matches
because if only part of a sentence matches the
input even if this part is an entire clause it is
unlikely that this sentence would be retrieved
(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text
types such as updated versions of technical
manuals
In this study we aim to address the
problem of retrieving sub-segment matches by
performing clause splitting on the source
segment as a pre-processing step for TM match
retrieval While matches for entire sentences
or almost entire sentences are the most useful
type of matches it is also less likely for such
matches to be found in most text types and
even less so for complex sentences Retrieving
clauses is desirable because there is a higher
chance for a match to be found for a clause than
for a complex sentence and at the same time
clauses are similar to sentences in that they
both contain a subject and a verb hence a
ldquocomplete thoughtrdquo therefore clause matches
are more likely to be in context and to actually
be used by the translator than phrase matches
for example
We perform experiments comparing the
match retrieval performance of TM tools when
they are used as is and when the input file and
the TM segments are first processed with a
clause splitter before being fed into the TM
tool The paper is organised as follows In
section 2 we discuss related work on TM
matching In section 3 we discuss how clause
splitting can be beneficial to TM matching and
how clause splitting was implemented for this
study We then describe our experiments in
section 4 and discuss the results in section 5
Finally we present our conclusion and future
work in section 6
Improving Translation Memory Matching through Clause
Splitting
Katerina Timonera
Research Group in Computational
Linguistics
University of Wolverhampton
krtimoneragmailcom
Ruslan Mitkov
Research Group in Computational
Linguistics
University of Wolverhampton
rmitkovwlvacuk
17
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
Conference Program
900ndash910 WelcomeConstantin Orasan
910ndash1010 Automatically tidying up and extending translation memories (invited talk)Marcello Federico FBK Italy
1015ndash1100 Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten
1130ndash1215 Spotting false translation segments in translation memoriesEduard Barbu
1215ndash1300 Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov
1430ndash1500 Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou
1500ndash1530 Increasing Coverage of Translation Memories with Linguistically Motivated Seg-ment Combination MethodsViacutet Baisa Ales Horak and Marek Medvedrsquo
1530ndash1600 CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Velaand Josef van Genabith
1615ndash1700 Round table and final discussion
ix
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 1ndash8Hissar Bulgaria Sept 2015
Creation of new TM segments Fulfilling translatorsrsquo wishes
Carla Parra EscartiacutenHermes Traducciones
C Coacutelquide 6 portal 2 3ž I28230 Las Rozas Madrid Spain
carlaparrahermestranscom
Abstract
This paper aims at bridging the gap be-tween academic research on TranslationMemories (TMs) and the actual needs andwishes of translators Starting from aninternal survey in a translation companywe analyse what translators wished trans-lation memories offered them We are cur-rently implementing one of their sugges-tions to assess its feasibility and whetheror not it retrieves more TM matches asoriginally expected We report on how thesuggestion is being implemented and theresults obtained in a pilot study
1 Introduction
Professional translators use translation memorieson a daily basis In fact most translation projectsnowadays require the usage of Computer AssistedTranslation (CAT) tools Sometimes translatorswill freely choose to work with a particular tooland sometimes it will be the client who imposesthe usage of such tool One of the main advantagesof CAT tools is that they allow for the integrationof different productivity enhancement features
Translation Memories (TMs) are used to re-trieve past translations and partial past translations(fuzzy matches) Terminology databases (TBs) areused to enhance the coherence on terminology andto ensure that the right terminology is used in allprojects Moreover some CAT tools offer addi-tional functionalities such as the usage of predic-tive texts (Autosuggest in SDL Trados Studio1
and Muses in MemoQ2) the automatic assem-bly of fragments to produce translations of newsegments or specific customizable Quality As-surance (QA) features
1wwwsdlcom2wwwmemoqcom
In the context of a translation company the us-age of these productivity enhacement tools is partof the whole translation workflow Project man-agers use them to generate word counts and esti-mate the time and resources needed to make thetranslation They also use them to pre-translatefiles and clean them up prior to delivery to theclient Translators use these tools to translate re-vise and proofread the translations prior to finaldelivery
Several researchers have worked on enhancingTMs using Natural Language Processing (NLP)techniques (Hodaacutesz and Pohl 2005 Pekar andMitkov 2007 Mitkov and Corpas 2008 Simardand Fujita 2012 Gupta and Orasan 2014Vanallemeersch and Vandeghinste 2015 Guptaet al 2015) Despite reporting positive resultsit seems that the gap between academic researchand industry still exists We carried out an inter-nal survey in our company to detect potential newfeatures for TMs that could be implemented byour RampD team The aim was to identify potentialnew features that project managers and translatorswished for and that would enhance their produc-tivity In this paper we report on the results of thatsurvey and further explain how we are implement-ing one of the suggestions we received The re-mainder of this paper is structured as follows Sec-tion 3 summarises the survey we carried out andthe replies we got and our company is briefly in-troduced in Section 2 Section 4 explains how weimplemented one of the suggestions we receivedSubsections 41ndash43 describe how we created newTM segments out of existing TMs Section 5 re-ports on the evaluation on a pilot test to assess thereal usability of this approach Finally section 6summarises our work and discusses future work
2 Our company
Hermes is a leader company in the translation in-dustry in Spain It was founded in 1991 and has
1
a vast experience in multilingual localisation andtranslation projects With offices in Madrid andMaacutelaga the company has broad knowledge andexperience in computer-assisted translation soft-ware and specific localisation software includingSDL Trados memoQ Deacutejagrave Vu IBM Translation-Manager Star Transit WordFast Catalyst Pas-solo Across Idiom World Server Microsoft He-lium and Microsoft Localisation Studio amongothers
Our company objectives are built upon a solidfoundation which have allowed us to achieve adouble quality certification for translation services(UNE-EN-150382006 and ISO-90012008 stan-dards) Our in-house translators and project man-agers commit daily to provide our clients with highquality translations for different specialised fields(IT medicine technical manuals general textsetc)
3 Internal survey
We asked our in-house translators and projectmanagers for potential new functionalities thatCAT tools could offer them More concretelywe asked them which was according to themthe missing functionality as far as TMs are con-cerned In total 10 project manager and 14 trans-lators participated in this internal survey Whilenot all had a clear idea of what could be imple-mented some interesting suggestions came up
We gathered ideas such as scheduling automaticTM reorganisation to prevent that large TMs endup corrupted It is a well known fact that largeTMs need to be periodically reorganized (ie re-indexed and eventually cleaned-up) While thisfeature is available in standard CAT tools such asStudio 20143 and memoQ4 it is not always car-ried out automatically and it is difficult to estimate
3In previous versions of Studio TM reorganisationwas available for all types of TMs in the Transla-tion Memory Settings Dialog Box gt Performance andTuning As of Studio 2014 file-based TMs are au-tomatically reorganised while server-based TMs re-quire periodical reorganisation For further informa-tion see httpproducthelpsdlcomSDL20Trados20Studioclient_enRefO-TTM_SettingsTranslation_Memory_Settings_Dialog_Box__Performance_and_Tuninghtm
4memoQ actually has a repairing function the Trans-lation memory repair wizard which aims at repairing(ie re-indexing) a corrupted TM According to thedocumentation it is also possible to run this functionon TMs which are not corrupted For further infor-mation see httpkilgraycommemoq2015help-enindexhtmlrepair_resource3html
when such reorganisation should be carried outScheduling it to be run automatically when a par-ticular number of segments (eg 500) have beenadded since the last reorganisation for instancewould prevent the loss of a TM because of badmaintenance
Ideas more related to NLP included the auto-matic correction of orthotypography in the tar-get language and allowing for multilingual TMswhere several source and target languages can beused for concordance searches at once Althoughcurrently it is possible to use multilingual TMs inCAT tools when starting a new project a languagepair has to be selected This leads to an underusageof the TM and the potential benefits of queryingmultiple languages at once are missed If for in-stance the TM contains a translation unit (TU) fora different pair of languages (eg German gt Ital-ian) than the ones selected for that specific project(eg German gt French) and the same source sen-tence is appearing in the text currently being trans-lated the translation into a different target lan-guage will not be shown (ie the Italian translationof the German TU will not be matched) The samewould occur with concordance searches Whilematches for a different language pair will not beused in the translation they may be useful for car-rying it out If a translator understands other targetlanguages these translations may give them a hintas to how to translate the same sentence into theirmother tongue5
Finally an interesting idea was to generate newsegments on the fly from fragments of previouslytranslated segments Flanagan (2015) offers aninteresting overview about the techniques usedby different CAT tools for subsegment matchingHere we will focus on memoQrsquos such functional-ity fragment assembly6
Figure 1 shows how this functionality worksin memoQ As may be observed memoQ looks
5For SDL Studio there seems to be an external appAnyTM that allows users to use TMs having differentlanguage pairs than the ones in the current translationproject As of Studio 2015 this app has been integrated inthe CAT tool and become a new feature However this tooldoes not seem to support the usage of truly multilingual TMs(TMs including several target languages for each segment)For further information on the tool see httpwwwtranslationzonecomopenexchangeappsdltradosstudioanytmtranslationprovider-669html)
6For further information see httpkilgraycommemoq2015help-enindexhtmlfragment_assemblyhtml
2
for fragments of the source sentence in other TMsegments and internally computes their alignmentprobabilities It then inserts the translations intothe source segment and suggests this new some-times partially translated sentence as a match Al-ternatively only the fragments translated will beinserted in the target segment one after anotherwithout the source sentence words that could notbe retrieved
Figure 1 memoQrsquos fragment assembly function-ality
One limitation of this functionality is that thefragment translations follow the order in whichthey appear in the source language Thus whileit may be very useful for pairs of languages whichfollow a similar structure it may be problematicfor pairs of languages which require reorderingmemoQ uses the frequencies of apparition of thefragments to select one translation or another foreach particular segment As a consequence insome cases the translation selection is wrong thusyielding wrong translation suggestions
Examples 1 and 2 show two similar sentencesin our TM
(1) EN The following message then appearsClick accept to run the programES Aparece el siguiente mensaje Hagaclick en aceptar para ejecutar el programa
(2) EN The window will show the followingThe application will closeES La ventana mostraraacute lo siguiente Secerraraacute la aplicacioacuten
Now imagine we need to translate the sentencein 3
(3) EN The following message then appearsThe application will close
Taking the part of 1 before the colon and the partof 2 after the colon we would be able to producethe right translation as shown in 4
(4) ES Aparece el siguiente mensaje Secerraraacute la aplicacioacuten
In technical texts it is often the case that situa-tions like the one just described happen more thanonce Thus it is not surprising that the transla-tors liked the idea and thought it would be nice tofind a way of automatically retrieving their trans-lations without having to do concordance searchesin the TM Moreover remembering that a partic-ular fragment of a segment had been translatedin the past is not always possible as translatorsmay have forgotten it or different translators mayhave been involved in the project thus not seeingfragments of a segment that other translators havetranslated already
As this idea seemed to have a great potential toincrease the number of TM and fuzzy matches wedecided to implement it and test whether it actu-ally worked The next Section (4) explains howwe proceeded
4 The new segment generator
As explained in Section 3 we decided to test oneidea originated from our internal survey The ideawas to generate new TM segments from fragmentsof already existing segments We called our newtool new segment generator
The first step was to assess the type of textsthat are translated in our company and identify thesegment fragments that could be easily extractedUpon analysis of several sample texts we identi-fied 7 different types of fragments we could workwith
1 Ellipsis2 Colons3 Parenthesis4 Square brackets5 Curly braces6 Quotations7 Text surrounded by tags
In the following subsections (41 ndash 43) we de-scribe how each of these types of fragments wastreated
3
41 Ellipsis and colonsOne possible type of segment would be that inwhich an ellipsis () or a colon () is used inthe middle of the segment In software localisationor user guides sentences such as the ones in 5 and6 could appear
(5) EN Installing new services ServiceXXXX for premium clients [2]ES Instalando nuevos servicios ServicioXXXX para clientes premium [2]
(6) EN You can use the line [abcdef] todescribe any of the following characters ab c d e fES Puede utilizar la liacutenea [abcdef] paradescribir cualquiera de los siguientescaracteres a b c d e f
If a different segment only including the textbefore the ellipsis appears as in Example 7 theTM may not retrieve any fuzzy match The samewould occur with other sentences with colons inwhich the fragment before or after the colon ap-pears
(7) Installing new services
In these cases we proceded as follows
1 Check that there is an ellipsis a colon onboth the source and the target segment
2 Split the segment in two being the first partthe fragment of the segment up to the ellip-siscolon and the second part the fragment ofthe segment after the ellipsiscolon
3 Create a new TM segment for each fragment
42 Parenthesis square brackets and curlybraces
Sometimes a sentence includes a fragment be-tween parethesis square brackets or curly bracesThe content within such characters may constitutea new segment on its own or appear in a differ-ent sentence At the same time it may also be thecase that the same sentence appears in the text butwithout such parenthesis When sentences like theones in Examples 8ndash10 appear it may thus be de-sirable to store the translation of the fragment be-tween the aforementioned characters and the sen-tence without such content
(8) EN Creates an installation package forapplication installation (if it was not createdearlier)ES Crea un paquete de instalacioacuten para lainstalacioacuten de la aplicacioacuten (si no se creoacuteantes)
(9) EN ltreturn code 1gt=[ltdescriptiongt]ES ltcoacutedigo de retorno 1gt=[ltdescripcioacutengt]
(10) EN Could not open key [2] Systemerror [3] Verify that you have sufficientaccess to that key or contact your supportpersonnelES Error al abrir la clave [2] Error en elsistema [3] Compruebe que dispone delos derechos de acceso
The strategy to create new segments was the fol-lowing
1 Check that there is content between parenthe-sis square brackets curly braces on both thesource and the target segment
2 Keep three fragments out of each sentence
(a) A sentence removing those charactersand the content between them
(b) A fragment starting at the opening char-acter and finishing on the closing oneand including the content within In thisfragment the parenthesis square brack-ets or curly braces are mantained
(c) A fragment containing only the con-tent withing those characters (parenthe-sis square brackets or curly braces) butwithout them
3 Create a new TM segment for each fragment
At this preliminary stage we considered thatwhen a sentence has several clauses in parenthesissquare brackets or curly braces these appear in thesame order in the target language This was doneso because for the type of texts used so far to testour application (software manuals) and the pair oflanguages used (English into Spanish) this seemsto be the usual case In future work we plan to fur-ther evaluate this issue and consider other waysof ensuring that the right translation is assigned toeach clause
4
43 Quotations and text within tagsQuotations and double tags appearing in the textwere handled differently As the text within thequotations or tags might be part of the sentencewhere it appears it could not be removed withoutadding too much noise to the data Thus we iden-tified sentences with quotations andor tags wethen removed the quotations andor tags and keptthe same sentence without them as a new segmentFinally we also kept the text within the quotationmarks or tags as new segments Examples 11ndash12illustrate this kind of segments
(11) EN You can only set the values of settingsthat the policy allows to be modified that isunlocked settingsES Solo se pueden establecer los valoresde los paraacutemetros que al directiva permitemodificar es decir los paraacutemetrosdesbloqueados
(12) EN If you clear the lt1gtInherit settingsfrom parent policylt1gt check box in thelt2gtSettings inheritancelt2gt section of thelt3gtGenerallt3gt section in the propertieswindow of an inherited policy the lockis lifted for that policyES Si anula la seleccioacuten de la casillalt1gtHeredar configuracioacuten de la directivaprimarialt1gt en la seccioacuten lt2gtHerencia deconfiguracioacutenlt2gt que aparece en la seccioacutenlt3gtGenerallt3gt de la ventana depropiedades de una directiva heredada seabriraacute el candado para esa directiva
5 Pilot test
Before integrating our system in a CAT tool andin our normal production workflows we deemedit better to run a pilot test The aim of this testwas to measure to which extent the new segmentsretrieved an increased number of 100 and fuzzymatches
51 Test setWe used as a test set a real translation project com-ing from one of our clients It is a software manualwritten in English and to be translated into Span-ish We selected memoQ 2015 to be the CAT toolused for our testing because it is one of the com-mon CAT tools used by our translators and be-cause we also wanted to measure the impact of
our approach when using its fragment assemblyfunctionality
The project had in total 425 segments account-ing for 6280 words according to memoQ Ta-ble 1 shows the project statistics as provided bymemoQrsquos analysis tool using the project TM pro-vided by the client Additionally memoQ iden-tified 36 segments (418 words) which could betranslated benefiting from its fragment assemblyfunctionality which uses fragments of segments tocreate new translations
TM match Words SegmentsRepetitions 1064 80
100 0 095-99 4 285-94 0 075-84 285 14
50-74 2523 187No Match 2404 142
Total 6280 425
Table 1 Project statistics according to memoQ us-ing the project TM
Taking this analysis as the starting point of ourpilot test we generated new segments using theapproach described in Section 4 We used threedifferent TMs to further assess whether the sizeof the translation memory matters for generatingtranslations of segments and retrieving more trans-lations The first TM (Project TM) was the projectTM provided by the client The second TM (Prod-uct TM) was a TM including all projects done forthe same product of the client Finally the thirdTM (Client TM) included all projects of that clientand thus was the biggest one Table 2 summarisesthe size of the three TMs
Segments WordsEN ES
Project TM 16842 212472 244159456631
Product TM 20923 274542 317797592339
Client TM 256099 3427861 39517327379593
Table 2 Size of the different TMs used
We then generated new TM segments and storedthem as new TMs Table 3 shows the number ofnew segments generated using our approach
Table 4 breaks down the number of segmentsgenerated using each strategy and for each TM
5
Segments EN ESProject TM 6776 56973 66297
new segments 123270Product TM 7760 71034 83125
new segments 154159new Client TM 74041 662714 769705new segments 1432419
Table 3 Size of the new TMs generated using ourapproach
used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific
TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792
Table 4 Number of newly generated segments pertype and TM used
We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following
1 TM1 The project TM as provided by theclient
2 TM2 Only the new segments generatedfrom the project TM provided by the client
3 TM3 A combination of the project TM andthe new segments retrieved from it
4 TM4 Only the new segments generatedfrom the product TM
5 TM5 The project TM and the new segmentsgenerated from the product TM
6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM
7 TM7 Only the new segments generatedfrom the client TM
8 TM8 The project TM combined with thenew segments generated from the client TM
9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM
10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM
11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM
The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation
As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)
However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments
When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-
6
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194
Table 5 Overview on the number of segments pre-translated translated using fragments or not started
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11
100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87
Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band
erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)
If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band
In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations
from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators
6 Conclusion and future work
In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent
We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators
We have also envisioned the combination ofalready existing methods for retrieving a higher
7
number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)
Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesKevin Flanagan 2015 Subsegment recall in Transla-
tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January
Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)
Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT
Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria
Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008
Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters
Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas
Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT
8
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015
Spotting false translation segments in translation memories
Eduard BarbuTranslatednet
eduardtranslatednet
Abstract
The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories
1 Introduction
MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface
The quality of the translations using the firstmethod is high and the errors are relatively few
1httpsmymemorytranslatednet
6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V
Figure 1 The distribution of automatically ac-quired memories in MyMemory
However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions
The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task
The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments
9
2 Related Work
The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval
There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora
A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours
2httpwwwxbenchnet
3 The data
In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types
1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web
2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)
3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments
4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)
The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors
10
It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1
CG =ls minus ldradic
34(ls + ld)(1)
In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments
If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426
The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the
3This simple idea is implemented in many sentence align-ers
4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom
Church Gale ScoreP
rob
ab
ility C
hu
rch
Ga
le S
co
re
minus4 minus2 0 2 4
00
02
04
06
08
Figure 2 The distribution of Church Gale Scoresin the Matecat Set
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
01
02
03
04
Figure 3 The distribution of Church Gale Scoresin the Collaborative Set
11
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
00
00
04
00
08
Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet
Church Gale score varies in the interval minus131516015
To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6
In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows
bull Training Set It contains 1243 bi-segmentsand has 373 negative example
bull Test Set It contains 309 bi-segments and has87 negatives examples
The proportion of the negative examples in bothsets is approximately 30
minus4 minus2 0 2 4
minus4
minus2
02
4Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 5 The Q-Q plot for the Matecat set
minus4 minus2 0 2 4
minus1
00
minus5
00
50
Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 6 The Q-Q plot for the Collaborative set
12
4 Machine Learning
In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them
41 FeaturesThe features computed for the training and test setare the following
bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt
bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language
bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0
bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0
bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0
bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0
bull has capital letters The value of the featureis 1 if the source or target segments contain
words that have at least a capital letter other-wise is 0
bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters
bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions
bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)
bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties
bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity
bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity
13
bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment
bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters
bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters
bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2
All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation
The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment
is done with the highly accurate language detectorCybozu5
42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)
bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree
bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees
bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores
bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms
bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances
bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can
5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd
14
give good results in classification problemswhere the decision boundary is irregular
5 Results and discussion
We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1
Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071
Table 1 The results of the three-fold stratifiedclassification
Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)
The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2
The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set
Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059
Table 2 The results of the classification on the testset
Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section
6 Conclusions and further work
In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the
15
performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database
There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative
Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471
ReferencesWilliam A Gale and Kenneth W Church 1993 A
program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS
B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April
Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-
lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31
Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA
F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830
Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226
Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory
Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas
16
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015
Abstract
We propose the integration of clause
splitting as a pre-processing step for match
retrieval in Translation Memory (TM)
systems to increase the number of relevant
sub-segment matches Through a series of
experiments we investigate the impact of
clause splitting in instances where the input
does not match an entire segment in the
TM but only a clause from a segment Our
results show that there is a statistically
significant increase in the number of
retrieved matches when both the input
segments and the segments in the TM are
first processed with a clause splitter
1 Rationale
Translation memory tools have had a great
impact on the translation industry as they
provide considerable assistance to translators
They allow translators to easily re-use
previous translations providing them with
valuable productivity gains in an industry
where there is a great demand for quality
translation delivered in the shortest possible
time However existing tools have some
shortcomings The majority of existing tools
rely on Levenshtein distance and seek to
identify matches only at the sentence level
Semantically similar segments are therefore
difficult to retrieve if the string similarity is not
high enough as are sub-segment matches
because if only part of a sentence matches the
input even if this part is an entire clause it is
unlikely that this sentence would be retrieved
(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text
types such as updated versions of technical
manuals
In this study we aim to address the
problem of retrieving sub-segment matches by
performing clause splitting on the source
segment as a pre-processing step for TM match
retrieval While matches for entire sentences
or almost entire sentences are the most useful
type of matches it is also less likely for such
matches to be found in most text types and
even less so for complex sentences Retrieving
clauses is desirable because there is a higher
chance for a match to be found for a clause than
for a complex sentence and at the same time
clauses are similar to sentences in that they
both contain a subject and a verb hence a
ldquocomplete thoughtrdquo therefore clause matches
are more likely to be in context and to actually
be used by the translator than phrase matches
for example
We perform experiments comparing the
match retrieval performance of TM tools when
they are used as is and when the input file and
the TM segments are first processed with a
clause splitter before being fed into the TM
tool The paper is organised as follows In
section 2 we discuss related work on TM
matching In section 3 we discuss how clause
splitting can be beneficial to TM matching and
how clause splitting was implemented for this
study We then describe our experiments in
section 4 and discuss the results in section 5
Finally we present our conclusion and future
work in section 6
Improving Translation Memory Matching through Clause
Splitting
Katerina Timonera
Research Group in Computational
Linguistics
University of Wolverhampton
krtimoneragmailcom
Ruslan Mitkov
Research Group in Computational
Linguistics
University of Wolverhampton
rmitkovwlvacuk
17
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 1ndash8Hissar Bulgaria Sept 2015
Creation of new TM segments Fulfilling translatorsrsquo wishes
Carla Parra EscartiacutenHermes Traducciones
C Coacutelquide 6 portal 2 3ž I28230 Las Rozas Madrid Spain
carlaparrahermestranscom
Abstract
This paper aims at bridging the gap be-tween academic research on TranslationMemories (TMs) and the actual needs andwishes of translators Starting from aninternal survey in a translation companywe analyse what translators wished trans-lation memories offered them We are cur-rently implementing one of their sugges-tions to assess its feasibility and whetheror not it retrieves more TM matches asoriginally expected We report on how thesuggestion is being implemented and theresults obtained in a pilot study
1 Introduction
Professional translators use translation memorieson a daily basis In fact most translation projectsnowadays require the usage of Computer AssistedTranslation (CAT) tools Sometimes translatorswill freely choose to work with a particular tooland sometimes it will be the client who imposesthe usage of such tool One of the main advantagesof CAT tools is that they allow for the integrationof different productivity enhancement features
Translation Memories (TMs) are used to re-trieve past translations and partial past translations(fuzzy matches) Terminology databases (TBs) areused to enhance the coherence on terminology andto ensure that the right terminology is used in allprojects Moreover some CAT tools offer addi-tional functionalities such as the usage of predic-tive texts (Autosuggest in SDL Trados Studio1
and Muses in MemoQ2) the automatic assem-bly of fragments to produce translations of newsegments or specific customizable Quality As-surance (QA) features
1wwwsdlcom2wwwmemoqcom
In the context of a translation company the us-age of these productivity enhacement tools is partof the whole translation workflow Project man-agers use them to generate word counts and esti-mate the time and resources needed to make thetranslation They also use them to pre-translatefiles and clean them up prior to delivery to theclient Translators use these tools to translate re-vise and proofread the translations prior to finaldelivery
Several researchers have worked on enhancingTMs using Natural Language Processing (NLP)techniques (Hodaacutesz and Pohl 2005 Pekar andMitkov 2007 Mitkov and Corpas 2008 Simardand Fujita 2012 Gupta and Orasan 2014Vanallemeersch and Vandeghinste 2015 Guptaet al 2015) Despite reporting positive resultsit seems that the gap between academic researchand industry still exists We carried out an inter-nal survey in our company to detect potential newfeatures for TMs that could be implemented byour RampD team The aim was to identify potentialnew features that project managers and translatorswished for and that would enhance their produc-tivity In this paper we report on the results of thatsurvey and further explain how we are implement-ing one of the suggestions we received The re-mainder of this paper is structured as follows Sec-tion 3 summarises the survey we carried out andthe replies we got and our company is briefly in-troduced in Section 2 Section 4 explains how weimplemented one of the suggestions we receivedSubsections 41ndash43 describe how we created newTM segments out of existing TMs Section 5 re-ports on the evaluation on a pilot test to assess thereal usability of this approach Finally section 6summarises our work and discusses future work
2 Our company
Hermes is a leader company in the translation in-dustry in Spain It was founded in 1991 and has
1
a vast experience in multilingual localisation andtranslation projects With offices in Madrid andMaacutelaga the company has broad knowledge andexperience in computer-assisted translation soft-ware and specific localisation software includingSDL Trados memoQ Deacutejagrave Vu IBM Translation-Manager Star Transit WordFast Catalyst Pas-solo Across Idiom World Server Microsoft He-lium and Microsoft Localisation Studio amongothers
Our company objectives are built upon a solidfoundation which have allowed us to achieve adouble quality certification for translation services(UNE-EN-150382006 and ISO-90012008 stan-dards) Our in-house translators and project man-agers commit daily to provide our clients with highquality translations for different specialised fields(IT medicine technical manuals general textsetc)
3 Internal survey
We asked our in-house translators and projectmanagers for potential new functionalities thatCAT tools could offer them More concretelywe asked them which was according to themthe missing functionality as far as TMs are con-cerned In total 10 project manager and 14 trans-lators participated in this internal survey Whilenot all had a clear idea of what could be imple-mented some interesting suggestions came up
We gathered ideas such as scheduling automaticTM reorganisation to prevent that large TMs endup corrupted It is a well known fact that largeTMs need to be periodically reorganized (ie re-indexed and eventually cleaned-up) While thisfeature is available in standard CAT tools such asStudio 20143 and memoQ4 it is not always car-ried out automatically and it is difficult to estimate
3In previous versions of Studio TM reorganisationwas available for all types of TMs in the Transla-tion Memory Settings Dialog Box gt Performance andTuning As of Studio 2014 file-based TMs are au-tomatically reorganised while server-based TMs re-quire periodical reorganisation For further informa-tion see httpproducthelpsdlcomSDL20Trados20Studioclient_enRefO-TTM_SettingsTranslation_Memory_Settings_Dialog_Box__Performance_and_Tuninghtm
4memoQ actually has a repairing function the Trans-lation memory repair wizard which aims at repairing(ie re-indexing) a corrupted TM According to thedocumentation it is also possible to run this functionon TMs which are not corrupted For further infor-mation see httpkilgraycommemoq2015help-enindexhtmlrepair_resource3html
when such reorganisation should be carried outScheduling it to be run automatically when a par-ticular number of segments (eg 500) have beenadded since the last reorganisation for instancewould prevent the loss of a TM because of badmaintenance
Ideas more related to NLP included the auto-matic correction of orthotypography in the tar-get language and allowing for multilingual TMswhere several source and target languages can beused for concordance searches at once Althoughcurrently it is possible to use multilingual TMs inCAT tools when starting a new project a languagepair has to be selected This leads to an underusageof the TM and the potential benefits of queryingmultiple languages at once are missed If for in-stance the TM contains a translation unit (TU) fora different pair of languages (eg German gt Ital-ian) than the ones selected for that specific project(eg German gt French) and the same source sen-tence is appearing in the text currently being trans-lated the translation into a different target lan-guage will not be shown (ie the Italian translationof the German TU will not be matched) The samewould occur with concordance searches Whilematches for a different language pair will not beused in the translation they may be useful for car-rying it out If a translator understands other targetlanguages these translations may give them a hintas to how to translate the same sentence into theirmother tongue5
Finally an interesting idea was to generate newsegments on the fly from fragments of previouslytranslated segments Flanagan (2015) offers aninteresting overview about the techniques usedby different CAT tools for subsegment matchingHere we will focus on memoQrsquos such functional-ity fragment assembly6
Figure 1 shows how this functionality worksin memoQ As may be observed memoQ looks
5For SDL Studio there seems to be an external appAnyTM that allows users to use TMs having differentlanguage pairs than the ones in the current translationproject As of Studio 2015 this app has been integrated inthe CAT tool and become a new feature However this tooldoes not seem to support the usage of truly multilingual TMs(TMs including several target languages for each segment)For further information on the tool see httpwwwtranslationzonecomopenexchangeappsdltradosstudioanytmtranslationprovider-669html)
6For further information see httpkilgraycommemoq2015help-enindexhtmlfragment_assemblyhtml
2
for fragments of the source sentence in other TMsegments and internally computes their alignmentprobabilities It then inserts the translations intothe source segment and suggests this new some-times partially translated sentence as a match Al-ternatively only the fragments translated will beinserted in the target segment one after anotherwithout the source sentence words that could notbe retrieved
Figure 1 memoQrsquos fragment assembly function-ality
One limitation of this functionality is that thefragment translations follow the order in whichthey appear in the source language Thus whileit may be very useful for pairs of languages whichfollow a similar structure it may be problematicfor pairs of languages which require reorderingmemoQ uses the frequencies of apparition of thefragments to select one translation or another foreach particular segment As a consequence insome cases the translation selection is wrong thusyielding wrong translation suggestions
Examples 1 and 2 show two similar sentencesin our TM
(1) EN The following message then appearsClick accept to run the programES Aparece el siguiente mensaje Hagaclick en aceptar para ejecutar el programa
(2) EN The window will show the followingThe application will closeES La ventana mostraraacute lo siguiente Secerraraacute la aplicacioacuten
Now imagine we need to translate the sentencein 3
(3) EN The following message then appearsThe application will close
Taking the part of 1 before the colon and the partof 2 after the colon we would be able to producethe right translation as shown in 4
(4) ES Aparece el siguiente mensaje Secerraraacute la aplicacioacuten
In technical texts it is often the case that situa-tions like the one just described happen more thanonce Thus it is not surprising that the transla-tors liked the idea and thought it would be nice tofind a way of automatically retrieving their trans-lations without having to do concordance searchesin the TM Moreover remembering that a partic-ular fragment of a segment had been translatedin the past is not always possible as translatorsmay have forgotten it or different translators mayhave been involved in the project thus not seeingfragments of a segment that other translators havetranslated already
As this idea seemed to have a great potential toincrease the number of TM and fuzzy matches wedecided to implement it and test whether it actu-ally worked The next Section (4) explains howwe proceeded
4 The new segment generator
As explained in Section 3 we decided to test oneidea originated from our internal survey The ideawas to generate new TM segments from fragmentsof already existing segments We called our newtool new segment generator
The first step was to assess the type of textsthat are translated in our company and identify thesegment fragments that could be easily extractedUpon analysis of several sample texts we identi-fied 7 different types of fragments we could workwith
1 Ellipsis2 Colons3 Parenthesis4 Square brackets5 Curly braces6 Quotations7 Text surrounded by tags
In the following subsections (41 ndash 43) we de-scribe how each of these types of fragments wastreated
3
41 Ellipsis and colonsOne possible type of segment would be that inwhich an ellipsis () or a colon () is used inthe middle of the segment In software localisationor user guides sentences such as the ones in 5 and6 could appear
(5) EN Installing new services ServiceXXXX for premium clients [2]ES Instalando nuevos servicios ServicioXXXX para clientes premium [2]
(6) EN You can use the line [abcdef] todescribe any of the following characters ab c d e fES Puede utilizar la liacutenea [abcdef] paradescribir cualquiera de los siguientescaracteres a b c d e f
If a different segment only including the textbefore the ellipsis appears as in Example 7 theTM may not retrieve any fuzzy match The samewould occur with other sentences with colons inwhich the fragment before or after the colon ap-pears
(7) Installing new services
In these cases we proceded as follows
1 Check that there is an ellipsis a colon onboth the source and the target segment
2 Split the segment in two being the first partthe fragment of the segment up to the ellip-siscolon and the second part the fragment ofthe segment after the ellipsiscolon
3 Create a new TM segment for each fragment
42 Parenthesis square brackets and curlybraces
Sometimes a sentence includes a fragment be-tween parethesis square brackets or curly bracesThe content within such characters may constitutea new segment on its own or appear in a differ-ent sentence At the same time it may also be thecase that the same sentence appears in the text butwithout such parenthesis When sentences like theones in Examples 8ndash10 appear it may thus be de-sirable to store the translation of the fragment be-tween the aforementioned characters and the sen-tence without such content
(8) EN Creates an installation package forapplication installation (if it was not createdearlier)ES Crea un paquete de instalacioacuten para lainstalacioacuten de la aplicacioacuten (si no se creoacuteantes)
(9) EN ltreturn code 1gt=[ltdescriptiongt]ES ltcoacutedigo de retorno 1gt=[ltdescripcioacutengt]
(10) EN Could not open key [2] Systemerror [3] Verify that you have sufficientaccess to that key or contact your supportpersonnelES Error al abrir la clave [2] Error en elsistema [3] Compruebe que dispone delos derechos de acceso
The strategy to create new segments was the fol-lowing
1 Check that there is content between parenthe-sis square brackets curly braces on both thesource and the target segment
2 Keep three fragments out of each sentence
(a) A sentence removing those charactersand the content between them
(b) A fragment starting at the opening char-acter and finishing on the closing oneand including the content within In thisfragment the parenthesis square brack-ets or curly braces are mantained
(c) A fragment containing only the con-tent withing those characters (parenthe-sis square brackets or curly braces) butwithout them
3 Create a new TM segment for each fragment
At this preliminary stage we considered thatwhen a sentence has several clauses in parenthesissquare brackets or curly braces these appear in thesame order in the target language This was doneso because for the type of texts used so far to testour application (software manuals) and the pair oflanguages used (English into Spanish) this seemsto be the usual case In future work we plan to fur-ther evaluate this issue and consider other waysof ensuring that the right translation is assigned toeach clause
4
43 Quotations and text within tagsQuotations and double tags appearing in the textwere handled differently As the text within thequotations or tags might be part of the sentencewhere it appears it could not be removed withoutadding too much noise to the data Thus we iden-tified sentences with quotations andor tags wethen removed the quotations andor tags and keptthe same sentence without them as a new segmentFinally we also kept the text within the quotationmarks or tags as new segments Examples 11ndash12illustrate this kind of segments
(11) EN You can only set the values of settingsthat the policy allows to be modified that isunlocked settingsES Solo se pueden establecer los valoresde los paraacutemetros que al directiva permitemodificar es decir los paraacutemetrosdesbloqueados
(12) EN If you clear the lt1gtInherit settingsfrom parent policylt1gt check box in thelt2gtSettings inheritancelt2gt section of thelt3gtGenerallt3gt section in the propertieswindow of an inherited policy the lockis lifted for that policyES Si anula la seleccioacuten de la casillalt1gtHeredar configuracioacuten de la directivaprimarialt1gt en la seccioacuten lt2gtHerencia deconfiguracioacutenlt2gt que aparece en la seccioacutenlt3gtGenerallt3gt de la ventana depropiedades de una directiva heredada seabriraacute el candado para esa directiva
5 Pilot test
Before integrating our system in a CAT tool andin our normal production workflows we deemedit better to run a pilot test The aim of this testwas to measure to which extent the new segmentsretrieved an increased number of 100 and fuzzymatches
51 Test setWe used as a test set a real translation project com-ing from one of our clients It is a software manualwritten in English and to be translated into Span-ish We selected memoQ 2015 to be the CAT toolused for our testing because it is one of the com-mon CAT tools used by our translators and be-cause we also wanted to measure the impact of
our approach when using its fragment assemblyfunctionality
The project had in total 425 segments account-ing for 6280 words according to memoQ Ta-ble 1 shows the project statistics as provided bymemoQrsquos analysis tool using the project TM pro-vided by the client Additionally memoQ iden-tified 36 segments (418 words) which could betranslated benefiting from its fragment assemblyfunctionality which uses fragments of segments tocreate new translations
TM match Words SegmentsRepetitions 1064 80
100 0 095-99 4 285-94 0 075-84 285 14
50-74 2523 187No Match 2404 142
Total 6280 425
Table 1 Project statistics according to memoQ us-ing the project TM
Taking this analysis as the starting point of ourpilot test we generated new segments using theapproach described in Section 4 We used threedifferent TMs to further assess whether the sizeof the translation memory matters for generatingtranslations of segments and retrieving more trans-lations The first TM (Project TM) was the projectTM provided by the client The second TM (Prod-uct TM) was a TM including all projects done forthe same product of the client Finally the thirdTM (Client TM) included all projects of that clientand thus was the biggest one Table 2 summarisesthe size of the three TMs
Segments WordsEN ES
Project TM 16842 212472 244159456631
Product TM 20923 274542 317797592339
Client TM 256099 3427861 39517327379593
Table 2 Size of the different TMs used
We then generated new TM segments and storedthem as new TMs Table 3 shows the number ofnew segments generated using our approach
Table 4 breaks down the number of segmentsgenerated using each strategy and for each TM
5
Segments EN ESProject TM 6776 56973 66297
new segments 123270Product TM 7760 71034 83125
new segments 154159new Client TM 74041 662714 769705new segments 1432419
Table 3 Size of the new TMs generated using ourapproach
used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific
TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792
Table 4 Number of newly generated segments pertype and TM used
We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following
1 TM1 The project TM as provided by theclient
2 TM2 Only the new segments generatedfrom the project TM provided by the client
3 TM3 A combination of the project TM andthe new segments retrieved from it
4 TM4 Only the new segments generatedfrom the product TM
5 TM5 The project TM and the new segmentsgenerated from the product TM
6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM
7 TM7 Only the new segments generatedfrom the client TM
8 TM8 The project TM combined with thenew segments generated from the client TM
9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM
10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM
11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM
The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation
As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)
However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments
When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-
6
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194
Table 5 Overview on the number of segments pre-translated translated using fragments or not started
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11
100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87
Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band
erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)
If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band
In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations
from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators
6 Conclusion and future work
In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent
We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators
We have also envisioned the combination ofalready existing methods for retrieving a higher
7
number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)
Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesKevin Flanagan 2015 Subsegment recall in Transla-
tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January
Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)
Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT
Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria
Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008
Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters
Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas
Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT
8
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015
Spotting false translation segments in translation memories
Eduard BarbuTranslatednet
eduardtranslatednet
Abstract
The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories
1 Introduction
MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface
The quality of the translations using the firstmethod is high and the errors are relatively few
1httpsmymemorytranslatednet
6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V
Figure 1 The distribution of automatically ac-quired memories in MyMemory
However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions
The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task
The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments
9
2 Related Work
The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval
There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora
A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours
2httpwwwxbenchnet
3 The data
In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types
1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web
2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)
3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments
4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)
The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors
10
It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1
CG =ls minus ldradic
34(ls + ld)(1)
In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments
If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426
The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the
3This simple idea is implemented in many sentence align-ers
4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom
Church Gale ScoreP
rob
ab
ility C
hu
rch
Ga
le S
co
re
minus4 minus2 0 2 4
00
02
04
06
08
Figure 2 The distribution of Church Gale Scoresin the Matecat Set
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
01
02
03
04
Figure 3 The distribution of Church Gale Scoresin the Collaborative Set
11
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
00
00
04
00
08
Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet
Church Gale score varies in the interval minus131516015
To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6
In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows
bull Training Set It contains 1243 bi-segmentsand has 373 negative example
bull Test Set It contains 309 bi-segments and has87 negatives examples
The proportion of the negative examples in bothsets is approximately 30
minus4 minus2 0 2 4
minus4
minus2
02
4Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 5 The Q-Q plot for the Matecat set
minus4 minus2 0 2 4
minus1
00
minus5
00
50
Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 6 The Q-Q plot for the Collaborative set
12
4 Machine Learning
In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them
41 FeaturesThe features computed for the training and test setare the following
bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt
bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language
bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0
bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0
bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0
bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0
bull has capital letters The value of the featureis 1 if the source or target segments contain
words that have at least a capital letter other-wise is 0
bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters
bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions
bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)
bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties
bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity
bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity
13
bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment
bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters
bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters
bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2
All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation
The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment
is done with the highly accurate language detectorCybozu5
42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)
bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree
bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees
bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores
bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms
bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances
bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can
5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd
14
give good results in classification problemswhere the decision boundary is irregular
5 Results and discussion
We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1
Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071
Table 1 The results of the three-fold stratifiedclassification
Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)
The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2
The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set
Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059
Table 2 The results of the classification on the testset
Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section
6 Conclusions and further work
In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the
15
performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database
There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative
Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471
ReferencesWilliam A Gale and Kenneth W Church 1993 A
program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS
B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April
Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-
lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31
Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA
F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830
Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226
Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory
Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas
16
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015
Abstract
We propose the integration of clause
splitting as a pre-processing step for match
retrieval in Translation Memory (TM)
systems to increase the number of relevant
sub-segment matches Through a series of
experiments we investigate the impact of
clause splitting in instances where the input
does not match an entire segment in the
TM but only a clause from a segment Our
results show that there is a statistically
significant increase in the number of
retrieved matches when both the input
segments and the segments in the TM are
first processed with a clause splitter
1 Rationale
Translation memory tools have had a great
impact on the translation industry as they
provide considerable assistance to translators
They allow translators to easily re-use
previous translations providing them with
valuable productivity gains in an industry
where there is a great demand for quality
translation delivered in the shortest possible
time However existing tools have some
shortcomings The majority of existing tools
rely on Levenshtein distance and seek to
identify matches only at the sentence level
Semantically similar segments are therefore
difficult to retrieve if the string similarity is not
high enough as are sub-segment matches
because if only part of a sentence matches the
input even if this part is an entire clause it is
unlikely that this sentence would be retrieved
(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text
types such as updated versions of technical
manuals
In this study we aim to address the
problem of retrieving sub-segment matches by
performing clause splitting on the source
segment as a pre-processing step for TM match
retrieval While matches for entire sentences
or almost entire sentences are the most useful
type of matches it is also less likely for such
matches to be found in most text types and
even less so for complex sentences Retrieving
clauses is desirable because there is a higher
chance for a match to be found for a clause than
for a complex sentence and at the same time
clauses are similar to sentences in that they
both contain a subject and a verb hence a
ldquocomplete thoughtrdquo therefore clause matches
are more likely to be in context and to actually
be used by the translator than phrase matches
for example
We perform experiments comparing the
match retrieval performance of TM tools when
they are used as is and when the input file and
the TM segments are first processed with a
clause splitter before being fed into the TM
tool The paper is organised as follows In
section 2 we discuss related work on TM
matching In section 3 we discuss how clause
splitting can be beneficial to TM matching and
how clause splitting was implemented for this
study We then describe our experiments in
section 4 and discuss the results in section 5
Finally we present our conclusion and future
work in section 6
Improving Translation Memory Matching through Clause
Splitting
Katerina Timonera
Research Group in Computational
Linguistics
University of Wolverhampton
krtimoneragmailcom
Ruslan Mitkov
Research Group in Computational
Linguistics
University of Wolverhampton
rmitkovwlvacuk
17
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
a vast experience in multilingual localisation andtranslation projects With offices in Madrid andMaacutelaga the company has broad knowledge andexperience in computer-assisted translation soft-ware and specific localisation software includingSDL Trados memoQ Deacutejagrave Vu IBM Translation-Manager Star Transit WordFast Catalyst Pas-solo Across Idiom World Server Microsoft He-lium and Microsoft Localisation Studio amongothers
Our company objectives are built upon a solidfoundation which have allowed us to achieve adouble quality certification for translation services(UNE-EN-150382006 and ISO-90012008 stan-dards) Our in-house translators and project man-agers commit daily to provide our clients with highquality translations for different specialised fields(IT medicine technical manuals general textsetc)
3 Internal survey
We asked our in-house translators and projectmanagers for potential new functionalities thatCAT tools could offer them More concretelywe asked them which was according to themthe missing functionality as far as TMs are con-cerned In total 10 project manager and 14 trans-lators participated in this internal survey Whilenot all had a clear idea of what could be imple-mented some interesting suggestions came up
We gathered ideas such as scheduling automaticTM reorganisation to prevent that large TMs endup corrupted It is a well known fact that largeTMs need to be periodically reorganized (ie re-indexed and eventually cleaned-up) While thisfeature is available in standard CAT tools such asStudio 20143 and memoQ4 it is not always car-ried out automatically and it is difficult to estimate
3In previous versions of Studio TM reorganisationwas available for all types of TMs in the Transla-tion Memory Settings Dialog Box gt Performance andTuning As of Studio 2014 file-based TMs are au-tomatically reorganised while server-based TMs re-quire periodical reorganisation For further informa-tion see httpproducthelpsdlcomSDL20Trados20Studioclient_enRefO-TTM_SettingsTranslation_Memory_Settings_Dialog_Box__Performance_and_Tuninghtm
4memoQ actually has a repairing function the Trans-lation memory repair wizard which aims at repairing(ie re-indexing) a corrupted TM According to thedocumentation it is also possible to run this functionon TMs which are not corrupted For further infor-mation see httpkilgraycommemoq2015help-enindexhtmlrepair_resource3html
when such reorganisation should be carried outScheduling it to be run automatically when a par-ticular number of segments (eg 500) have beenadded since the last reorganisation for instancewould prevent the loss of a TM because of badmaintenance
Ideas more related to NLP included the auto-matic correction of orthotypography in the tar-get language and allowing for multilingual TMswhere several source and target languages can beused for concordance searches at once Althoughcurrently it is possible to use multilingual TMs inCAT tools when starting a new project a languagepair has to be selected This leads to an underusageof the TM and the potential benefits of queryingmultiple languages at once are missed If for in-stance the TM contains a translation unit (TU) fora different pair of languages (eg German gt Ital-ian) than the ones selected for that specific project(eg German gt French) and the same source sen-tence is appearing in the text currently being trans-lated the translation into a different target lan-guage will not be shown (ie the Italian translationof the German TU will not be matched) The samewould occur with concordance searches Whilematches for a different language pair will not beused in the translation they may be useful for car-rying it out If a translator understands other targetlanguages these translations may give them a hintas to how to translate the same sentence into theirmother tongue5
Finally an interesting idea was to generate newsegments on the fly from fragments of previouslytranslated segments Flanagan (2015) offers aninteresting overview about the techniques usedby different CAT tools for subsegment matchingHere we will focus on memoQrsquos such functional-ity fragment assembly6
Figure 1 shows how this functionality worksin memoQ As may be observed memoQ looks
5For SDL Studio there seems to be an external appAnyTM that allows users to use TMs having differentlanguage pairs than the ones in the current translationproject As of Studio 2015 this app has been integrated inthe CAT tool and become a new feature However this tooldoes not seem to support the usage of truly multilingual TMs(TMs including several target languages for each segment)For further information on the tool see httpwwwtranslationzonecomopenexchangeappsdltradosstudioanytmtranslationprovider-669html)
6For further information see httpkilgraycommemoq2015help-enindexhtmlfragment_assemblyhtml
2
for fragments of the source sentence in other TMsegments and internally computes their alignmentprobabilities It then inserts the translations intothe source segment and suggests this new some-times partially translated sentence as a match Al-ternatively only the fragments translated will beinserted in the target segment one after anotherwithout the source sentence words that could notbe retrieved
Figure 1 memoQrsquos fragment assembly function-ality
One limitation of this functionality is that thefragment translations follow the order in whichthey appear in the source language Thus whileit may be very useful for pairs of languages whichfollow a similar structure it may be problematicfor pairs of languages which require reorderingmemoQ uses the frequencies of apparition of thefragments to select one translation or another foreach particular segment As a consequence insome cases the translation selection is wrong thusyielding wrong translation suggestions
Examples 1 and 2 show two similar sentencesin our TM
(1) EN The following message then appearsClick accept to run the programES Aparece el siguiente mensaje Hagaclick en aceptar para ejecutar el programa
(2) EN The window will show the followingThe application will closeES La ventana mostraraacute lo siguiente Secerraraacute la aplicacioacuten
Now imagine we need to translate the sentencein 3
(3) EN The following message then appearsThe application will close
Taking the part of 1 before the colon and the partof 2 after the colon we would be able to producethe right translation as shown in 4
(4) ES Aparece el siguiente mensaje Secerraraacute la aplicacioacuten
In technical texts it is often the case that situa-tions like the one just described happen more thanonce Thus it is not surprising that the transla-tors liked the idea and thought it would be nice tofind a way of automatically retrieving their trans-lations without having to do concordance searchesin the TM Moreover remembering that a partic-ular fragment of a segment had been translatedin the past is not always possible as translatorsmay have forgotten it or different translators mayhave been involved in the project thus not seeingfragments of a segment that other translators havetranslated already
As this idea seemed to have a great potential toincrease the number of TM and fuzzy matches wedecided to implement it and test whether it actu-ally worked The next Section (4) explains howwe proceeded
4 The new segment generator
As explained in Section 3 we decided to test oneidea originated from our internal survey The ideawas to generate new TM segments from fragmentsof already existing segments We called our newtool new segment generator
The first step was to assess the type of textsthat are translated in our company and identify thesegment fragments that could be easily extractedUpon analysis of several sample texts we identi-fied 7 different types of fragments we could workwith
1 Ellipsis2 Colons3 Parenthesis4 Square brackets5 Curly braces6 Quotations7 Text surrounded by tags
In the following subsections (41 ndash 43) we de-scribe how each of these types of fragments wastreated
3
41 Ellipsis and colonsOne possible type of segment would be that inwhich an ellipsis () or a colon () is used inthe middle of the segment In software localisationor user guides sentences such as the ones in 5 and6 could appear
(5) EN Installing new services ServiceXXXX for premium clients [2]ES Instalando nuevos servicios ServicioXXXX para clientes premium [2]
(6) EN You can use the line [abcdef] todescribe any of the following characters ab c d e fES Puede utilizar la liacutenea [abcdef] paradescribir cualquiera de los siguientescaracteres a b c d e f
If a different segment only including the textbefore the ellipsis appears as in Example 7 theTM may not retrieve any fuzzy match The samewould occur with other sentences with colons inwhich the fragment before or after the colon ap-pears
(7) Installing new services
In these cases we proceded as follows
1 Check that there is an ellipsis a colon onboth the source and the target segment
2 Split the segment in two being the first partthe fragment of the segment up to the ellip-siscolon and the second part the fragment ofthe segment after the ellipsiscolon
3 Create a new TM segment for each fragment
42 Parenthesis square brackets and curlybraces
Sometimes a sentence includes a fragment be-tween parethesis square brackets or curly bracesThe content within such characters may constitutea new segment on its own or appear in a differ-ent sentence At the same time it may also be thecase that the same sentence appears in the text butwithout such parenthesis When sentences like theones in Examples 8ndash10 appear it may thus be de-sirable to store the translation of the fragment be-tween the aforementioned characters and the sen-tence without such content
(8) EN Creates an installation package forapplication installation (if it was not createdearlier)ES Crea un paquete de instalacioacuten para lainstalacioacuten de la aplicacioacuten (si no se creoacuteantes)
(9) EN ltreturn code 1gt=[ltdescriptiongt]ES ltcoacutedigo de retorno 1gt=[ltdescripcioacutengt]
(10) EN Could not open key [2] Systemerror [3] Verify that you have sufficientaccess to that key or contact your supportpersonnelES Error al abrir la clave [2] Error en elsistema [3] Compruebe que dispone delos derechos de acceso
The strategy to create new segments was the fol-lowing
1 Check that there is content between parenthe-sis square brackets curly braces on both thesource and the target segment
2 Keep three fragments out of each sentence
(a) A sentence removing those charactersand the content between them
(b) A fragment starting at the opening char-acter and finishing on the closing oneand including the content within In thisfragment the parenthesis square brack-ets or curly braces are mantained
(c) A fragment containing only the con-tent withing those characters (parenthe-sis square brackets or curly braces) butwithout them
3 Create a new TM segment for each fragment
At this preliminary stage we considered thatwhen a sentence has several clauses in parenthesissquare brackets or curly braces these appear in thesame order in the target language This was doneso because for the type of texts used so far to testour application (software manuals) and the pair oflanguages used (English into Spanish) this seemsto be the usual case In future work we plan to fur-ther evaluate this issue and consider other waysof ensuring that the right translation is assigned toeach clause
4
43 Quotations and text within tagsQuotations and double tags appearing in the textwere handled differently As the text within thequotations or tags might be part of the sentencewhere it appears it could not be removed withoutadding too much noise to the data Thus we iden-tified sentences with quotations andor tags wethen removed the quotations andor tags and keptthe same sentence without them as a new segmentFinally we also kept the text within the quotationmarks or tags as new segments Examples 11ndash12illustrate this kind of segments
(11) EN You can only set the values of settingsthat the policy allows to be modified that isunlocked settingsES Solo se pueden establecer los valoresde los paraacutemetros que al directiva permitemodificar es decir los paraacutemetrosdesbloqueados
(12) EN If you clear the lt1gtInherit settingsfrom parent policylt1gt check box in thelt2gtSettings inheritancelt2gt section of thelt3gtGenerallt3gt section in the propertieswindow of an inherited policy the lockis lifted for that policyES Si anula la seleccioacuten de la casillalt1gtHeredar configuracioacuten de la directivaprimarialt1gt en la seccioacuten lt2gtHerencia deconfiguracioacutenlt2gt que aparece en la seccioacutenlt3gtGenerallt3gt de la ventana depropiedades de una directiva heredada seabriraacute el candado para esa directiva
5 Pilot test
Before integrating our system in a CAT tool andin our normal production workflows we deemedit better to run a pilot test The aim of this testwas to measure to which extent the new segmentsretrieved an increased number of 100 and fuzzymatches
51 Test setWe used as a test set a real translation project com-ing from one of our clients It is a software manualwritten in English and to be translated into Span-ish We selected memoQ 2015 to be the CAT toolused for our testing because it is one of the com-mon CAT tools used by our translators and be-cause we also wanted to measure the impact of
our approach when using its fragment assemblyfunctionality
The project had in total 425 segments account-ing for 6280 words according to memoQ Ta-ble 1 shows the project statistics as provided bymemoQrsquos analysis tool using the project TM pro-vided by the client Additionally memoQ iden-tified 36 segments (418 words) which could betranslated benefiting from its fragment assemblyfunctionality which uses fragments of segments tocreate new translations
TM match Words SegmentsRepetitions 1064 80
100 0 095-99 4 285-94 0 075-84 285 14
50-74 2523 187No Match 2404 142
Total 6280 425
Table 1 Project statistics according to memoQ us-ing the project TM
Taking this analysis as the starting point of ourpilot test we generated new segments using theapproach described in Section 4 We used threedifferent TMs to further assess whether the sizeof the translation memory matters for generatingtranslations of segments and retrieving more trans-lations The first TM (Project TM) was the projectTM provided by the client The second TM (Prod-uct TM) was a TM including all projects done forthe same product of the client Finally the thirdTM (Client TM) included all projects of that clientand thus was the biggest one Table 2 summarisesthe size of the three TMs
Segments WordsEN ES
Project TM 16842 212472 244159456631
Product TM 20923 274542 317797592339
Client TM 256099 3427861 39517327379593
Table 2 Size of the different TMs used
We then generated new TM segments and storedthem as new TMs Table 3 shows the number ofnew segments generated using our approach
Table 4 breaks down the number of segmentsgenerated using each strategy and for each TM
5
Segments EN ESProject TM 6776 56973 66297
new segments 123270Product TM 7760 71034 83125
new segments 154159new Client TM 74041 662714 769705new segments 1432419
Table 3 Size of the new TMs generated using ourapproach
used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific
TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792
Table 4 Number of newly generated segments pertype and TM used
We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following
1 TM1 The project TM as provided by theclient
2 TM2 Only the new segments generatedfrom the project TM provided by the client
3 TM3 A combination of the project TM andthe new segments retrieved from it
4 TM4 Only the new segments generatedfrom the product TM
5 TM5 The project TM and the new segmentsgenerated from the product TM
6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM
7 TM7 Only the new segments generatedfrom the client TM
8 TM8 The project TM combined with thenew segments generated from the client TM
9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM
10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM
11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM
The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation
As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)
However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments
When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-
6
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194
Table 5 Overview on the number of segments pre-translated translated using fragments or not started
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11
100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87
Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band
erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)
If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band
In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations
from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators
6 Conclusion and future work
In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent
We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators
We have also envisioned the combination ofalready existing methods for retrieving a higher
7
number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)
Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesKevin Flanagan 2015 Subsegment recall in Transla-
tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January
Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)
Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT
Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria
Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008
Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters
Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas
Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT
8
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015
Spotting false translation segments in translation memories
Eduard BarbuTranslatednet
eduardtranslatednet
Abstract
The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories
1 Introduction
MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface
The quality of the translations using the firstmethod is high and the errors are relatively few
1httpsmymemorytranslatednet
6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V
Figure 1 The distribution of automatically ac-quired memories in MyMemory
However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions
The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task
The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments
9
2 Related Work
The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval
There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora
A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours
2httpwwwxbenchnet
3 The data
In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types
1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web
2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)
3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments
4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)
The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors
10
It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1
CG =ls minus ldradic
34(ls + ld)(1)
In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments
If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426
The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the
3This simple idea is implemented in many sentence align-ers
4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom
Church Gale ScoreP
rob
ab
ility C
hu
rch
Ga
le S
co
re
minus4 minus2 0 2 4
00
02
04
06
08
Figure 2 The distribution of Church Gale Scoresin the Matecat Set
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
01
02
03
04
Figure 3 The distribution of Church Gale Scoresin the Collaborative Set
11
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
00
00
04
00
08
Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet
Church Gale score varies in the interval minus131516015
To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6
In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows
bull Training Set It contains 1243 bi-segmentsand has 373 negative example
bull Test Set It contains 309 bi-segments and has87 negatives examples
The proportion of the negative examples in bothsets is approximately 30
minus4 minus2 0 2 4
minus4
minus2
02
4Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 5 The Q-Q plot for the Matecat set
minus4 minus2 0 2 4
minus1
00
minus5
00
50
Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 6 The Q-Q plot for the Collaborative set
12
4 Machine Learning
In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them
41 FeaturesThe features computed for the training and test setare the following
bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt
bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language
bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0
bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0
bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0
bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0
bull has capital letters The value of the featureis 1 if the source or target segments contain
words that have at least a capital letter other-wise is 0
bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters
bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions
bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)
bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties
bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity
bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity
13
bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment
bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters
bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters
bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2
All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation
The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment
is done with the highly accurate language detectorCybozu5
42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)
bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree
bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees
bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores
bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms
bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances
bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can
5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd
14
give good results in classification problemswhere the decision boundary is irregular
5 Results and discussion
We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1
Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071
Table 1 The results of the three-fold stratifiedclassification
Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)
The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2
The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set
Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059
Table 2 The results of the classification on the testset
Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section
6 Conclusions and further work
In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the
15
performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database
There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative
Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471
ReferencesWilliam A Gale and Kenneth W Church 1993 A
program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS
B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April
Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-
lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31
Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA
F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830
Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226
Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory
Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas
16
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015
Abstract
We propose the integration of clause
splitting as a pre-processing step for match
retrieval in Translation Memory (TM)
systems to increase the number of relevant
sub-segment matches Through a series of
experiments we investigate the impact of
clause splitting in instances where the input
does not match an entire segment in the
TM but only a clause from a segment Our
results show that there is a statistically
significant increase in the number of
retrieved matches when both the input
segments and the segments in the TM are
first processed with a clause splitter
1 Rationale
Translation memory tools have had a great
impact on the translation industry as they
provide considerable assistance to translators
They allow translators to easily re-use
previous translations providing them with
valuable productivity gains in an industry
where there is a great demand for quality
translation delivered in the shortest possible
time However existing tools have some
shortcomings The majority of existing tools
rely on Levenshtein distance and seek to
identify matches only at the sentence level
Semantically similar segments are therefore
difficult to retrieve if the string similarity is not
high enough as are sub-segment matches
because if only part of a sentence matches the
input even if this part is an entire clause it is
unlikely that this sentence would be retrieved
(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text
types such as updated versions of technical
manuals
In this study we aim to address the
problem of retrieving sub-segment matches by
performing clause splitting on the source
segment as a pre-processing step for TM match
retrieval While matches for entire sentences
or almost entire sentences are the most useful
type of matches it is also less likely for such
matches to be found in most text types and
even less so for complex sentences Retrieving
clauses is desirable because there is a higher
chance for a match to be found for a clause than
for a complex sentence and at the same time
clauses are similar to sentences in that they
both contain a subject and a verb hence a
ldquocomplete thoughtrdquo therefore clause matches
are more likely to be in context and to actually
be used by the translator than phrase matches
for example
We perform experiments comparing the
match retrieval performance of TM tools when
they are used as is and when the input file and
the TM segments are first processed with a
clause splitter before being fed into the TM
tool The paper is organised as follows In
section 2 we discuss related work on TM
matching In section 3 we discuss how clause
splitting can be beneficial to TM matching and
how clause splitting was implemented for this
study We then describe our experiments in
section 4 and discuss the results in section 5
Finally we present our conclusion and future
work in section 6
Improving Translation Memory Matching through Clause
Splitting
Katerina Timonera
Research Group in Computational
Linguistics
University of Wolverhampton
krtimoneragmailcom
Ruslan Mitkov
Research Group in Computational
Linguistics
University of Wolverhampton
rmitkovwlvacuk
17
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
for fragments of the source sentence in other TMsegments and internally computes their alignmentprobabilities It then inserts the translations intothe source segment and suggests this new some-times partially translated sentence as a match Al-ternatively only the fragments translated will beinserted in the target segment one after anotherwithout the source sentence words that could notbe retrieved
Figure 1 memoQrsquos fragment assembly function-ality
One limitation of this functionality is that thefragment translations follow the order in whichthey appear in the source language Thus whileit may be very useful for pairs of languages whichfollow a similar structure it may be problematicfor pairs of languages which require reorderingmemoQ uses the frequencies of apparition of thefragments to select one translation or another foreach particular segment As a consequence insome cases the translation selection is wrong thusyielding wrong translation suggestions
Examples 1 and 2 show two similar sentencesin our TM
(1) EN The following message then appearsClick accept to run the programES Aparece el siguiente mensaje Hagaclick en aceptar para ejecutar el programa
(2) EN The window will show the followingThe application will closeES La ventana mostraraacute lo siguiente Secerraraacute la aplicacioacuten
Now imagine we need to translate the sentencein 3
(3) EN The following message then appearsThe application will close
Taking the part of 1 before the colon and the partof 2 after the colon we would be able to producethe right translation as shown in 4
(4) ES Aparece el siguiente mensaje Secerraraacute la aplicacioacuten
In technical texts it is often the case that situa-tions like the one just described happen more thanonce Thus it is not surprising that the transla-tors liked the idea and thought it would be nice tofind a way of automatically retrieving their trans-lations without having to do concordance searchesin the TM Moreover remembering that a partic-ular fragment of a segment had been translatedin the past is not always possible as translatorsmay have forgotten it or different translators mayhave been involved in the project thus not seeingfragments of a segment that other translators havetranslated already
As this idea seemed to have a great potential toincrease the number of TM and fuzzy matches wedecided to implement it and test whether it actu-ally worked The next Section (4) explains howwe proceeded
4 The new segment generator
As explained in Section 3 we decided to test oneidea originated from our internal survey The ideawas to generate new TM segments from fragmentsof already existing segments We called our newtool new segment generator
The first step was to assess the type of textsthat are translated in our company and identify thesegment fragments that could be easily extractedUpon analysis of several sample texts we identi-fied 7 different types of fragments we could workwith
1 Ellipsis2 Colons3 Parenthesis4 Square brackets5 Curly braces6 Quotations7 Text surrounded by tags
In the following subsections (41 ndash 43) we de-scribe how each of these types of fragments wastreated
3
41 Ellipsis and colonsOne possible type of segment would be that inwhich an ellipsis () or a colon () is used inthe middle of the segment In software localisationor user guides sentences such as the ones in 5 and6 could appear
(5) EN Installing new services ServiceXXXX for premium clients [2]ES Instalando nuevos servicios ServicioXXXX para clientes premium [2]
(6) EN You can use the line [abcdef] todescribe any of the following characters ab c d e fES Puede utilizar la liacutenea [abcdef] paradescribir cualquiera de los siguientescaracteres a b c d e f
If a different segment only including the textbefore the ellipsis appears as in Example 7 theTM may not retrieve any fuzzy match The samewould occur with other sentences with colons inwhich the fragment before or after the colon ap-pears
(7) Installing new services
In these cases we proceded as follows
1 Check that there is an ellipsis a colon onboth the source and the target segment
2 Split the segment in two being the first partthe fragment of the segment up to the ellip-siscolon and the second part the fragment ofthe segment after the ellipsiscolon
3 Create a new TM segment for each fragment
42 Parenthesis square brackets and curlybraces
Sometimes a sentence includes a fragment be-tween parethesis square brackets or curly bracesThe content within such characters may constitutea new segment on its own or appear in a differ-ent sentence At the same time it may also be thecase that the same sentence appears in the text butwithout such parenthesis When sentences like theones in Examples 8ndash10 appear it may thus be de-sirable to store the translation of the fragment be-tween the aforementioned characters and the sen-tence without such content
(8) EN Creates an installation package forapplication installation (if it was not createdearlier)ES Crea un paquete de instalacioacuten para lainstalacioacuten de la aplicacioacuten (si no se creoacuteantes)
(9) EN ltreturn code 1gt=[ltdescriptiongt]ES ltcoacutedigo de retorno 1gt=[ltdescripcioacutengt]
(10) EN Could not open key [2] Systemerror [3] Verify that you have sufficientaccess to that key or contact your supportpersonnelES Error al abrir la clave [2] Error en elsistema [3] Compruebe que dispone delos derechos de acceso
The strategy to create new segments was the fol-lowing
1 Check that there is content between parenthe-sis square brackets curly braces on both thesource and the target segment
2 Keep three fragments out of each sentence
(a) A sentence removing those charactersand the content between them
(b) A fragment starting at the opening char-acter and finishing on the closing oneand including the content within In thisfragment the parenthesis square brack-ets or curly braces are mantained
(c) A fragment containing only the con-tent withing those characters (parenthe-sis square brackets or curly braces) butwithout them
3 Create a new TM segment for each fragment
At this preliminary stage we considered thatwhen a sentence has several clauses in parenthesissquare brackets or curly braces these appear in thesame order in the target language This was doneso because for the type of texts used so far to testour application (software manuals) and the pair oflanguages used (English into Spanish) this seemsto be the usual case In future work we plan to fur-ther evaluate this issue and consider other waysof ensuring that the right translation is assigned toeach clause
4
43 Quotations and text within tagsQuotations and double tags appearing in the textwere handled differently As the text within thequotations or tags might be part of the sentencewhere it appears it could not be removed withoutadding too much noise to the data Thus we iden-tified sentences with quotations andor tags wethen removed the quotations andor tags and keptthe same sentence without them as a new segmentFinally we also kept the text within the quotationmarks or tags as new segments Examples 11ndash12illustrate this kind of segments
(11) EN You can only set the values of settingsthat the policy allows to be modified that isunlocked settingsES Solo se pueden establecer los valoresde los paraacutemetros que al directiva permitemodificar es decir los paraacutemetrosdesbloqueados
(12) EN If you clear the lt1gtInherit settingsfrom parent policylt1gt check box in thelt2gtSettings inheritancelt2gt section of thelt3gtGenerallt3gt section in the propertieswindow of an inherited policy the lockis lifted for that policyES Si anula la seleccioacuten de la casillalt1gtHeredar configuracioacuten de la directivaprimarialt1gt en la seccioacuten lt2gtHerencia deconfiguracioacutenlt2gt que aparece en la seccioacutenlt3gtGenerallt3gt de la ventana depropiedades de una directiva heredada seabriraacute el candado para esa directiva
5 Pilot test
Before integrating our system in a CAT tool andin our normal production workflows we deemedit better to run a pilot test The aim of this testwas to measure to which extent the new segmentsretrieved an increased number of 100 and fuzzymatches
51 Test setWe used as a test set a real translation project com-ing from one of our clients It is a software manualwritten in English and to be translated into Span-ish We selected memoQ 2015 to be the CAT toolused for our testing because it is one of the com-mon CAT tools used by our translators and be-cause we also wanted to measure the impact of
our approach when using its fragment assemblyfunctionality
The project had in total 425 segments account-ing for 6280 words according to memoQ Ta-ble 1 shows the project statistics as provided bymemoQrsquos analysis tool using the project TM pro-vided by the client Additionally memoQ iden-tified 36 segments (418 words) which could betranslated benefiting from its fragment assemblyfunctionality which uses fragments of segments tocreate new translations
TM match Words SegmentsRepetitions 1064 80
100 0 095-99 4 285-94 0 075-84 285 14
50-74 2523 187No Match 2404 142
Total 6280 425
Table 1 Project statistics according to memoQ us-ing the project TM
Taking this analysis as the starting point of ourpilot test we generated new segments using theapproach described in Section 4 We used threedifferent TMs to further assess whether the sizeof the translation memory matters for generatingtranslations of segments and retrieving more trans-lations The first TM (Project TM) was the projectTM provided by the client The second TM (Prod-uct TM) was a TM including all projects done forthe same product of the client Finally the thirdTM (Client TM) included all projects of that clientand thus was the biggest one Table 2 summarisesthe size of the three TMs
Segments WordsEN ES
Project TM 16842 212472 244159456631
Product TM 20923 274542 317797592339
Client TM 256099 3427861 39517327379593
Table 2 Size of the different TMs used
We then generated new TM segments and storedthem as new TMs Table 3 shows the number ofnew segments generated using our approach
Table 4 breaks down the number of segmentsgenerated using each strategy and for each TM
5
Segments EN ESProject TM 6776 56973 66297
new segments 123270Product TM 7760 71034 83125
new segments 154159new Client TM 74041 662714 769705new segments 1432419
Table 3 Size of the new TMs generated using ourapproach
used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific
TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792
Table 4 Number of newly generated segments pertype and TM used
We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following
1 TM1 The project TM as provided by theclient
2 TM2 Only the new segments generatedfrom the project TM provided by the client
3 TM3 A combination of the project TM andthe new segments retrieved from it
4 TM4 Only the new segments generatedfrom the product TM
5 TM5 The project TM and the new segmentsgenerated from the product TM
6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM
7 TM7 Only the new segments generatedfrom the client TM
8 TM8 The project TM combined with thenew segments generated from the client TM
9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM
10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM
11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM
The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation
As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)
However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments
When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-
6
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194
Table 5 Overview on the number of segments pre-translated translated using fragments or not started
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11
100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87
Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band
erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)
If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band
In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations
from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators
6 Conclusion and future work
In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent
We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators
We have also envisioned the combination ofalready existing methods for retrieving a higher
7
number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)
Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesKevin Flanagan 2015 Subsegment recall in Transla-
tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January
Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)
Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT
Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria
Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008
Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters
Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas
Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT
8
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015
Spotting false translation segments in translation memories
Eduard BarbuTranslatednet
eduardtranslatednet
Abstract
The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories
1 Introduction
MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface
The quality of the translations using the firstmethod is high and the errors are relatively few
1httpsmymemorytranslatednet
6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V
Figure 1 The distribution of automatically ac-quired memories in MyMemory
However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions
The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task
The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments
9
2 Related Work
The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval
There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora
A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours
2httpwwwxbenchnet
3 The data
In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types
1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web
2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)
3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments
4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)
The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors
10
It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1
CG =ls minus ldradic
34(ls + ld)(1)
In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments
If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426
The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the
3This simple idea is implemented in many sentence align-ers
4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom
Church Gale ScoreP
rob
ab
ility C
hu
rch
Ga
le S
co
re
minus4 minus2 0 2 4
00
02
04
06
08
Figure 2 The distribution of Church Gale Scoresin the Matecat Set
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
01
02
03
04
Figure 3 The distribution of Church Gale Scoresin the Collaborative Set
11
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
00
00
04
00
08
Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet
Church Gale score varies in the interval minus131516015
To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6
In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows
bull Training Set It contains 1243 bi-segmentsand has 373 negative example
bull Test Set It contains 309 bi-segments and has87 negatives examples
The proportion of the negative examples in bothsets is approximately 30
minus4 minus2 0 2 4
minus4
minus2
02
4Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 5 The Q-Q plot for the Matecat set
minus4 minus2 0 2 4
minus1
00
minus5
00
50
Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 6 The Q-Q plot for the Collaborative set
12
4 Machine Learning
In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them
41 FeaturesThe features computed for the training and test setare the following
bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt
bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language
bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0
bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0
bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0
bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0
bull has capital letters The value of the featureis 1 if the source or target segments contain
words that have at least a capital letter other-wise is 0
bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters
bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions
bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)
bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties
bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity
bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity
13
bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment
bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters
bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters
bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2
All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation
The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment
is done with the highly accurate language detectorCybozu5
42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)
bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree
bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees
bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores
bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms
bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances
bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can
5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd
14
give good results in classification problemswhere the decision boundary is irregular
5 Results and discussion
We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1
Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071
Table 1 The results of the three-fold stratifiedclassification
Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)
The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2
The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set
Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059
Table 2 The results of the classification on the testset
Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section
6 Conclusions and further work
In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the
15
performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database
There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative
Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471
ReferencesWilliam A Gale and Kenneth W Church 1993 A
program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS
B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April
Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-
lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31
Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA
F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830
Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226
Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory
Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas
16
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015
Abstract
We propose the integration of clause
splitting as a pre-processing step for match
retrieval in Translation Memory (TM)
systems to increase the number of relevant
sub-segment matches Through a series of
experiments we investigate the impact of
clause splitting in instances where the input
does not match an entire segment in the
TM but only a clause from a segment Our
results show that there is a statistically
significant increase in the number of
retrieved matches when both the input
segments and the segments in the TM are
first processed with a clause splitter
1 Rationale
Translation memory tools have had a great
impact on the translation industry as they
provide considerable assistance to translators
They allow translators to easily re-use
previous translations providing them with
valuable productivity gains in an industry
where there is a great demand for quality
translation delivered in the shortest possible
time However existing tools have some
shortcomings The majority of existing tools
rely on Levenshtein distance and seek to
identify matches only at the sentence level
Semantically similar segments are therefore
difficult to retrieve if the string similarity is not
high enough as are sub-segment matches
because if only part of a sentence matches the
input even if this part is an entire clause it is
unlikely that this sentence would be retrieved
(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text
types such as updated versions of technical
manuals
In this study we aim to address the
problem of retrieving sub-segment matches by
performing clause splitting on the source
segment as a pre-processing step for TM match
retrieval While matches for entire sentences
or almost entire sentences are the most useful
type of matches it is also less likely for such
matches to be found in most text types and
even less so for complex sentences Retrieving
clauses is desirable because there is a higher
chance for a match to be found for a clause than
for a complex sentence and at the same time
clauses are similar to sentences in that they
both contain a subject and a verb hence a
ldquocomplete thoughtrdquo therefore clause matches
are more likely to be in context and to actually
be used by the translator than phrase matches
for example
We perform experiments comparing the
match retrieval performance of TM tools when
they are used as is and when the input file and
the TM segments are first processed with a
clause splitter before being fed into the TM
tool The paper is organised as follows In
section 2 we discuss related work on TM
matching In section 3 we discuss how clause
splitting can be beneficial to TM matching and
how clause splitting was implemented for this
study We then describe our experiments in
section 4 and discuss the results in section 5
Finally we present our conclusion and future
work in section 6
Improving Translation Memory Matching through Clause
Splitting
Katerina Timonera
Research Group in Computational
Linguistics
University of Wolverhampton
krtimoneragmailcom
Ruslan Mitkov
Research Group in Computational
Linguistics
University of Wolverhampton
rmitkovwlvacuk
17
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
41 Ellipsis and colonsOne possible type of segment would be that inwhich an ellipsis () or a colon () is used inthe middle of the segment In software localisationor user guides sentences such as the ones in 5 and6 could appear
(5) EN Installing new services ServiceXXXX for premium clients [2]ES Instalando nuevos servicios ServicioXXXX para clientes premium [2]
(6) EN You can use the line [abcdef] todescribe any of the following characters ab c d e fES Puede utilizar la liacutenea [abcdef] paradescribir cualquiera de los siguientescaracteres a b c d e f
If a different segment only including the textbefore the ellipsis appears as in Example 7 theTM may not retrieve any fuzzy match The samewould occur with other sentences with colons inwhich the fragment before or after the colon ap-pears
(7) Installing new services
In these cases we proceded as follows
1 Check that there is an ellipsis a colon onboth the source and the target segment
2 Split the segment in two being the first partthe fragment of the segment up to the ellip-siscolon and the second part the fragment ofthe segment after the ellipsiscolon
3 Create a new TM segment for each fragment
42 Parenthesis square brackets and curlybraces
Sometimes a sentence includes a fragment be-tween parethesis square brackets or curly bracesThe content within such characters may constitutea new segment on its own or appear in a differ-ent sentence At the same time it may also be thecase that the same sentence appears in the text butwithout such parenthesis When sentences like theones in Examples 8ndash10 appear it may thus be de-sirable to store the translation of the fragment be-tween the aforementioned characters and the sen-tence without such content
(8) EN Creates an installation package forapplication installation (if it was not createdearlier)ES Crea un paquete de instalacioacuten para lainstalacioacuten de la aplicacioacuten (si no se creoacuteantes)
(9) EN ltreturn code 1gt=[ltdescriptiongt]ES ltcoacutedigo de retorno 1gt=[ltdescripcioacutengt]
(10) EN Could not open key [2] Systemerror [3] Verify that you have sufficientaccess to that key or contact your supportpersonnelES Error al abrir la clave [2] Error en elsistema [3] Compruebe que dispone delos derechos de acceso
The strategy to create new segments was the fol-lowing
1 Check that there is content between parenthe-sis square brackets curly braces on both thesource and the target segment
2 Keep three fragments out of each sentence
(a) A sentence removing those charactersand the content between them
(b) A fragment starting at the opening char-acter and finishing on the closing oneand including the content within In thisfragment the parenthesis square brack-ets or curly braces are mantained
(c) A fragment containing only the con-tent withing those characters (parenthe-sis square brackets or curly braces) butwithout them
3 Create a new TM segment for each fragment
At this preliminary stage we considered thatwhen a sentence has several clauses in parenthesissquare brackets or curly braces these appear in thesame order in the target language This was doneso because for the type of texts used so far to testour application (software manuals) and the pair oflanguages used (English into Spanish) this seemsto be the usual case In future work we plan to fur-ther evaluate this issue and consider other waysof ensuring that the right translation is assigned toeach clause
4
43 Quotations and text within tagsQuotations and double tags appearing in the textwere handled differently As the text within thequotations or tags might be part of the sentencewhere it appears it could not be removed withoutadding too much noise to the data Thus we iden-tified sentences with quotations andor tags wethen removed the quotations andor tags and keptthe same sentence without them as a new segmentFinally we also kept the text within the quotationmarks or tags as new segments Examples 11ndash12illustrate this kind of segments
(11) EN You can only set the values of settingsthat the policy allows to be modified that isunlocked settingsES Solo se pueden establecer los valoresde los paraacutemetros que al directiva permitemodificar es decir los paraacutemetrosdesbloqueados
(12) EN If you clear the lt1gtInherit settingsfrom parent policylt1gt check box in thelt2gtSettings inheritancelt2gt section of thelt3gtGenerallt3gt section in the propertieswindow of an inherited policy the lockis lifted for that policyES Si anula la seleccioacuten de la casillalt1gtHeredar configuracioacuten de la directivaprimarialt1gt en la seccioacuten lt2gtHerencia deconfiguracioacutenlt2gt que aparece en la seccioacutenlt3gtGenerallt3gt de la ventana depropiedades de una directiva heredada seabriraacute el candado para esa directiva
5 Pilot test
Before integrating our system in a CAT tool andin our normal production workflows we deemedit better to run a pilot test The aim of this testwas to measure to which extent the new segmentsretrieved an increased number of 100 and fuzzymatches
51 Test setWe used as a test set a real translation project com-ing from one of our clients It is a software manualwritten in English and to be translated into Span-ish We selected memoQ 2015 to be the CAT toolused for our testing because it is one of the com-mon CAT tools used by our translators and be-cause we also wanted to measure the impact of
our approach when using its fragment assemblyfunctionality
The project had in total 425 segments account-ing for 6280 words according to memoQ Ta-ble 1 shows the project statistics as provided bymemoQrsquos analysis tool using the project TM pro-vided by the client Additionally memoQ iden-tified 36 segments (418 words) which could betranslated benefiting from its fragment assemblyfunctionality which uses fragments of segments tocreate new translations
TM match Words SegmentsRepetitions 1064 80
100 0 095-99 4 285-94 0 075-84 285 14
50-74 2523 187No Match 2404 142
Total 6280 425
Table 1 Project statistics according to memoQ us-ing the project TM
Taking this analysis as the starting point of ourpilot test we generated new segments using theapproach described in Section 4 We used threedifferent TMs to further assess whether the sizeof the translation memory matters for generatingtranslations of segments and retrieving more trans-lations The first TM (Project TM) was the projectTM provided by the client The second TM (Prod-uct TM) was a TM including all projects done forthe same product of the client Finally the thirdTM (Client TM) included all projects of that clientand thus was the biggest one Table 2 summarisesthe size of the three TMs
Segments WordsEN ES
Project TM 16842 212472 244159456631
Product TM 20923 274542 317797592339
Client TM 256099 3427861 39517327379593
Table 2 Size of the different TMs used
We then generated new TM segments and storedthem as new TMs Table 3 shows the number ofnew segments generated using our approach
Table 4 breaks down the number of segmentsgenerated using each strategy and for each TM
5
Segments EN ESProject TM 6776 56973 66297
new segments 123270Product TM 7760 71034 83125
new segments 154159new Client TM 74041 662714 769705new segments 1432419
Table 3 Size of the new TMs generated using ourapproach
used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific
TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792
Table 4 Number of newly generated segments pertype and TM used
We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following
1 TM1 The project TM as provided by theclient
2 TM2 Only the new segments generatedfrom the project TM provided by the client
3 TM3 A combination of the project TM andthe new segments retrieved from it
4 TM4 Only the new segments generatedfrom the product TM
5 TM5 The project TM and the new segmentsgenerated from the product TM
6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM
7 TM7 Only the new segments generatedfrom the client TM
8 TM8 The project TM combined with thenew segments generated from the client TM
9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM
10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM
11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM
The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation
As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)
However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments
When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-
6
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194
Table 5 Overview on the number of segments pre-translated translated using fragments or not started
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11
100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87
Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band
erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)
If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band
In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations
from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators
6 Conclusion and future work
In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent
We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators
We have also envisioned the combination ofalready existing methods for retrieving a higher
7
number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)
Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesKevin Flanagan 2015 Subsegment recall in Transla-
tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January
Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)
Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT
Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria
Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008
Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters
Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas
Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT
8
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015
Spotting false translation segments in translation memories
Eduard BarbuTranslatednet
eduardtranslatednet
Abstract
The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories
1 Introduction
MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface
The quality of the translations using the firstmethod is high and the errors are relatively few
1httpsmymemorytranslatednet
6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V
Figure 1 The distribution of automatically ac-quired memories in MyMemory
However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions
The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task
The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments
9
2 Related Work
The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval
There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora
A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours
2httpwwwxbenchnet
3 The data
In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types
1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web
2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)
3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments
4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)
The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors
10
It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1
CG =ls minus ldradic
34(ls + ld)(1)
In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments
If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426
The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the
3This simple idea is implemented in many sentence align-ers
4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom
Church Gale ScoreP
rob
ab
ility C
hu
rch
Ga
le S
co
re
minus4 minus2 0 2 4
00
02
04
06
08
Figure 2 The distribution of Church Gale Scoresin the Matecat Set
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
01
02
03
04
Figure 3 The distribution of Church Gale Scoresin the Collaborative Set
11
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
00
00
04
00
08
Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet
Church Gale score varies in the interval minus131516015
To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6
In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows
bull Training Set It contains 1243 bi-segmentsand has 373 negative example
bull Test Set It contains 309 bi-segments and has87 negatives examples
The proportion of the negative examples in bothsets is approximately 30
minus4 minus2 0 2 4
minus4
minus2
02
4Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 5 The Q-Q plot for the Matecat set
minus4 minus2 0 2 4
minus1
00
minus5
00
50
Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 6 The Q-Q plot for the Collaborative set
12
4 Machine Learning
In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them
41 FeaturesThe features computed for the training and test setare the following
bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt
bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language
bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0
bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0
bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0
bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0
bull has capital letters The value of the featureis 1 if the source or target segments contain
words that have at least a capital letter other-wise is 0
bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters
bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions
bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)
bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties
bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity
bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity
13
bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment
bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters
bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters
bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2
All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation
The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment
is done with the highly accurate language detectorCybozu5
42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)
bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree
bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees
bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores
bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms
bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances
bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can
5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd
14
give good results in classification problemswhere the decision boundary is irregular
5 Results and discussion
We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1
Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071
Table 1 The results of the three-fold stratifiedclassification
Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)
The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2
The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set
Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059
Table 2 The results of the classification on the testset
Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section
6 Conclusions and further work
In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the
15
performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database
There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative
Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471
ReferencesWilliam A Gale and Kenneth W Church 1993 A
program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS
B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April
Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-
lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31
Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA
F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830
Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226
Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory
Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas
16
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015
Abstract
We propose the integration of clause
splitting as a pre-processing step for match
retrieval in Translation Memory (TM)
systems to increase the number of relevant
sub-segment matches Through a series of
experiments we investigate the impact of
clause splitting in instances where the input
does not match an entire segment in the
TM but only a clause from a segment Our
results show that there is a statistically
significant increase in the number of
retrieved matches when both the input
segments and the segments in the TM are
first processed with a clause splitter
1 Rationale
Translation memory tools have had a great
impact on the translation industry as they
provide considerable assistance to translators
They allow translators to easily re-use
previous translations providing them with
valuable productivity gains in an industry
where there is a great demand for quality
translation delivered in the shortest possible
time However existing tools have some
shortcomings The majority of existing tools
rely on Levenshtein distance and seek to
identify matches only at the sentence level
Semantically similar segments are therefore
difficult to retrieve if the string similarity is not
high enough as are sub-segment matches
because if only part of a sentence matches the
input even if this part is an entire clause it is
unlikely that this sentence would be retrieved
(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text
types such as updated versions of technical
manuals
In this study we aim to address the
problem of retrieving sub-segment matches by
performing clause splitting on the source
segment as a pre-processing step for TM match
retrieval While matches for entire sentences
or almost entire sentences are the most useful
type of matches it is also less likely for such
matches to be found in most text types and
even less so for complex sentences Retrieving
clauses is desirable because there is a higher
chance for a match to be found for a clause than
for a complex sentence and at the same time
clauses are similar to sentences in that they
both contain a subject and a verb hence a
ldquocomplete thoughtrdquo therefore clause matches
are more likely to be in context and to actually
be used by the translator than phrase matches
for example
We perform experiments comparing the
match retrieval performance of TM tools when
they are used as is and when the input file and
the TM segments are first processed with a
clause splitter before being fed into the TM
tool The paper is organised as follows In
section 2 we discuss related work on TM
matching In section 3 we discuss how clause
splitting can be beneficial to TM matching and
how clause splitting was implemented for this
study We then describe our experiments in
section 4 and discuss the results in section 5
Finally we present our conclusion and future
work in section 6
Improving Translation Memory Matching through Clause
Splitting
Katerina Timonera
Research Group in Computational
Linguistics
University of Wolverhampton
krtimoneragmailcom
Ruslan Mitkov
Research Group in Computational
Linguistics
University of Wolverhampton
rmitkovwlvacuk
17
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
43 Quotations and text within tagsQuotations and double tags appearing in the textwere handled differently As the text within thequotations or tags might be part of the sentencewhere it appears it could not be removed withoutadding too much noise to the data Thus we iden-tified sentences with quotations andor tags wethen removed the quotations andor tags and keptthe same sentence without them as a new segmentFinally we also kept the text within the quotationmarks or tags as new segments Examples 11ndash12illustrate this kind of segments
(11) EN You can only set the values of settingsthat the policy allows to be modified that isunlocked settingsES Solo se pueden establecer los valoresde los paraacutemetros que al directiva permitemodificar es decir los paraacutemetrosdesbloqueados
(12) EN If you clear the lt1gtInherit settingsfrom parent policylt1gt check box in thelt2gtSettings inheritancelt2gt section of thelt3gtGenerallt3gt section in the propertieswindow of an inherited policy the lockis lifted for that policyES Si anula la seleccioacuten de la casillalt1gtHeredar configuracioacuten de la directivaprimarialt1gt en la seccioacuten lt2gtHerencia deconfiguracioacutenlt2gt que aparece en la seccioacutenlt3gtGenerallt3gt de la ventana depropiedades de una directiva heredada seabriraacute el candado para esa directiva
5 Pilot test
Before integrating our system in a CAT tool andin our normal production workflows we deemedit better to run a pilot test The aim of this testwas to measure to which extent the new segmentsretrieved an increased number of 100 and fuzzymatches
51 Test setWe used as a test set a real translation project com-ing from one of our clients It is a software manualwritten in English and to be translated into Span-ish We selected memoQ 2015 to be the CAT toolused for our testing because it is one of the com-mon CAT tools used by our translators and be-cause we also wanted to measure the impact of
our approach when using its fragment assemblyfunctionality
The project had in total 425 segments account-ing for 6280 words according to memoQ Ta-ble 1 shows the project statistics as provided bymemoQrsquos analysis tool using the project TM pro-vided by the client Additionally memoQ iden-tified 36 segments (418 words) which could betranslated benefiting from its fragment assemblyfunctionality which uses fragments of segments tocreate new translations
TM match Words SegmentsRepetitions 1064 80
100 0 095-99 4 285-94 0 075-84 285 14
50-74 2523 187No Match 2404 142
Total 6280 425
Table 1 Project statistics according to memoQ us-ing the project TM
Taking this analysis as the starting point of ourpilot test we generated new segments using theapproach described in Section 4 We used threedifferent TMs to further assess whether the sizeof the translation memory matters for generatingtranslations of segments and retrieving more trans-lations The first TM (Project TM) was the projectTM provided by the client The second TM (Prod-uct TM) was a TM including all projects done forthe same product of the client Finally the thirdTM (Client TM) included all projects of that clientand thus was the biggest one Table 2 summarisesthe size of the three TMs
Segments WordsEN ES
Project TM 16842 212472 244159456631
Product TM 20923 274542 317797592339
Client TM 256099 3427861 39517327379593
Table 2 Size of the different TMs used
We then generated new TM segments and storedthem as new TMs Table 3 shows the number ofnew segments generated using our approach
Table 4 breaks down the number of segmentsgenerated using each strategy and for each TM
5
Segments EN ESProject TM 6776 56973 66297
new segments 123270Product TM 7760 71034 83125
new segments 154159new Client TM 74041 662714 769705new segments 1432419
Table 3 Size of the new TMs generated using ourapproach
used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific
TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792
Table 4 Number of newly generated segments pertype and TM used
We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following
1 TM1 The project TM as provided by theclient
2 TM2 Only the new segments generatedfrom the project TM provided by the client
3 TM3 A combination of the project TM andthe new segments retrieved from it
4 TM4 Only the new segments generatedfrom the product TM
5 TM5 The project TM and the new segmentsgenerated from the product TM
6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM
7 TM7 Only the new segments generatedfrom the client TM
8 TM8 The project TM combined with thenew segments generated from the client TM
9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM
10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM
11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM
The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation
As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)
However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments
When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-
6
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194
Table 5 Overview on the number of segments pre-translated translated using fragments or not started
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11
100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87
Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band
erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)
If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band
In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations
from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators
6 Conclusion and future work
In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent
We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators
We have also envisioned the combination ofalready existing methods for retrieving a higher
7
number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)
Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesKevin Flanagan 2015 Subsegment recall in Transla-
tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January
Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)
Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT
Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria
Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008
Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters
Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas
Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT
8
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015
Spotting false translation segments in translation memories
Eduard BarbuTranslatednet
eduardtranslatednet
Abstract
The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories
1 Introduction
MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface
The quality of the translations using the firstmethod is high and the errors are relatively few
1httpsmymemorytranslatednet
6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V
Figure 1 The distribution of automatically ac-quired memories in MyMemory
However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions
The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task
The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments
9
2 Related Work
The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval
There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora
A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours
2httpwwwxbenchnet
3 The data
In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types
1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web
2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)
3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments
4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)
The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors
10
It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1
CG =ls minus ldradic
34(ls + ld)(1)
In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments
If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426
The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the
3This simple idea is implemented in many sentence align-ers
4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom
Church Gale ScoreP
rob
ab
ility C
hu
rch
Ga
le S
co
re
minus4 minus2 0 2 4
00
02
04
06
08
Figure 2 The distribution of Church Gale Scoresin the Matecat Set
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
01
02
03
04
Figure 3 The distribution of Church Gale Scoresin the Collaborative Set
11
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
00
00
04
00
08
Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet
Church Gale score varies in the interval minus131516015
To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6
In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows
bull Training Set It contains 1243 bi-segmentsand has 373 negative example
bull Test Set It contains 309 bi-segments and has87 negatives examples
The proportion of the negative examples in bothsets is approximately 30
minus4 minus2 0 2 4
minus4
minus2
02
4Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 5 The Q-Q plot for the Matecat set
minus4 minus2 0 2 4
minus1
00
minus5
00
50
Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 6 The Q-Q plot for the Collaborative set
12
4 Machine Learning
In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them
41 FeaturesThe features computed for the training and test setare the following
bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt
bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language
bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0
bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0
bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0
bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0
bull has capital letters The value of the featureis 1 if the source or target segments contain
words that have at least a capital letter other-wise is 0
bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters
bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions
bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)
bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties
bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity
bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity
13
bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment
bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters
bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters
bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2
All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation
The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment
is done with the highly accurate language detectorCybozu5
42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)
bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree
bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees
bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores
bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms
bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances
bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can
5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd
14
give good results in classification problemswhere the decision boundary is irregular
5 Results and discussion
We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1
Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071
Table 1 The results of the three-fold stratifiedclassification
Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)
The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2
The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set
Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059
Table 2 The results of the classification on the testset
Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section
6 Conclusions and further work
In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the
15
performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database
There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative
Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471
ReferencesWilliam A Gale and Kenneth W Church 1993 A
program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS
B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April
Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-
lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31
Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA
F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830
Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226
Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory
Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas
16
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015
Abstract
We propose the integration of clause
splitting as a pre-processing step for match
retrieval in Translation Memory (TM)
systems to increase the number of relevant
sub-segment matches Through a series of
experiments we investigate the impact of
clause splitting in instances where the input
does not match an entire segment in the
TM but only a clause from a segment Our
results show that there is a statistically
significant increase in the number of
retrieved matches when both the input
segments and the segments in the TM are
first processed with a clause splitter
1 Rationale
Translation memory tools have had a great
impact on the translation industry as they
provide considerable assistance to translators
They allow translators to easily re-use
previous translations providing them with
valuable productivity gains in an industry
where there is a great demand for quality
translation delivered in the shortest possible
time However existing tools have some
shortcomings The majority of existing tools
rely on Levenshtein distance and seek to
identify matches only at the sentence level
Semantically similar segments are therefore
difficult to retrieve if the string similarity is not
high enough as are sub-segment matches
because if only part of a sentence matches the
input even if this part is an entire clause it is
unlikely that this sentence would be retrieved
(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text
types such as updated versions of technical
manuals
In this study we aim to address the
problem of retrieving sub-segment matches by
performing clause splitting on the source
segment as a pre-processing step for TM match
retrieval While matches for entire sentences
or almost entire sentences are the most useful
type of matches it is also less likely for such
matches to be found in most text types and
even less so for complex sentences Retrieving
clauses is desirable because there is a higher
chance for a match to be found for a clause than
for a complex sentence and at the same time
clauses are similar to sentences in that they
both contain a subject and a verb hence a
ldquocomplete thoughtrdquo therefore clause matches
are more likely to be in context and to actually
be used by the translator than phrase matches
for example
We perform experiments comparing the
match retrieval performance of TM tools when
they are used as is and when the input file and
the TM segments are first processed with a
clause splitter before being fed into the TM
tool The paper is organised as follows In
section 2 we discuss related work on TM
matching In section 3 we discuss how clause
splitting can be beneficial to TM matching and
how clause splitting was implemented for this
study We then describe our experiments in
section 4 and discuss the results in section 5
Finally we present our conclusion and future
work in section 6
Improving Translation Memory Matching through Clause
Splitting
Katerina Timonera
Research Group in Computational
Linguistics
University of Wolverhampton
krtimoneragmailcom
Ruslan Mitkov
Research Group in Computational
Linguistics
University of Wolverhampton
rmitkovwlvacuk
17
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
Segments EN ESProject TM 6776 56973 66297
new segments 123270Product TM 7760 71034 83125
new segments 154159new Client TM 74041 662714 769705new segments 1432419
Table 3 Size of the new TMs generated using ourapproach
used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific
TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792
Table 4 Number of newly generated segments pertype and TM used
We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following
1 TM1 The project TM as provided by theclient
2 TM2 Only the new segments generatedfrom the project TM provided by the client
3 TM3 A combination of the project TM andthe new segments retrieved from it
4 TM4 Only the new segments generatedfrom the product TM
5 TM5 The project TM and the new segmentsgenerated from the product TM
6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM
7 TM7 Only the new segments generatedfrom the client TM
8 TM8 The project TM combined with thenew segments generated from the client TM
9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM
10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM
11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM
The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation
As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)
However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments
When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-
6
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194
Table 5 Overview on the number of segments pre-translated translated using fragments or not started
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11
100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87
Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band
erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)
If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band
In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations
from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators
6 Conclusion and future work
In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent
We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators
We have also envisioned the combination ofalready existing methods for retrieving a higher
7
number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)
Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesKevin Flanagan 2015 Subsegment recall in Transla-
tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January
Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)
Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT
Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria
Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008
Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters
Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas
Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT
8
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015
Spotting false translation segments in translation memories
Eduard BarbuTranslatednet
eduardtranslatednet
Abstract
The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories
1 Introduction
MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface
The quality of the translations using the firstmethod is high and the errors are relatively few
1httpsmymemorytranslatednet
6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V
Figure 1 The distribution of automatically ac-quired memories in MyMemory
However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions
The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task
The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments
9
2 Related Work
The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval
There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora
A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours
2httpwwwxbenchnet
3 The data
In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types
1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web
2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)
3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments
4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)
The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors
10
It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1
CG =ls minus ldradic
34(ls + ld)(1)
In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments
If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426
The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the
3This simple idea is implemented in many sentence align-ers
4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom
Church Gale ScoreP
rob
ab
ility C
hu
rch
Ga
le S
co
re
minus4 minus2 0 2 4
00
02
04
06
08
Figure 2 The distribution of Church Gale Scoresin the Matecat Set
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
01
02
03
04
Figure 3 The distribution of Church Gale Scoresin the Collaborative Set
11
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
00
00
04
00
08
Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet
Church Gale score varies in the interval minus131516015
To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6
In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows
bull Training Set It contains 1243 bi-segmentsand has 373 negative example
bull Test Set It contains 309 bi-segments and has87 negatives examples
The proportion of the negative examples in bothsets is approximately 30
minus4 minus2 0 2 4
minus4
minus2
02
4Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 5 The Q-Q plot for the Matecat set
minus4 minus2 0 2 4
minus1
00
minus5
00
50
Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 6 The Q-Q plot for the Collaborative set
12
4 Machine Learning
In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them
41 FeaturesThe features computed for the training and test setare the following
bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt
bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language
bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0
bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0
bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0
bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0
bull has capital letters The value of the featureis 1 if the source or target segments contain
words that have at least a capital letter other-wise is 0
bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters
bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions
bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)
bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties
bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity
bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity
13
bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment
bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters
bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters
bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2
All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation
The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment
is done with the highly accurate language detectorCybozu5
42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)
bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree
bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees
bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores
bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms
bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances
bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can
5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd
14
give good results in classification problemswhere the decision boundary is irregular
5 Results and discussion
We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1
Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071
Table 1 The results of the three-fold stratifiedclassification
Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)
The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2
The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set
Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059
Table 2 The results of the classification on the testset
Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section
6 Conclusions and further work
In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the
15
performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database
There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative
Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471
ReferencesWilliam A Gale and Kenneth W Church 1993 A
program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS
B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April
Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-
lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31
Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA
F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830
Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226
Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory
Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas
16
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015
Abstract
We propose the integration of clause
splitting as a pre-processing step for match
retrieval in Translation Memory (TM)
systems to increase the number of relevant
sub-segment matches Through a series of
experiments we investigate the impact of
clause splitting in instances where the input
does not match an entire segment in the
TM but only a clause from a segment Our
results show that there is a statistically
significant increase in the number of
retrieved matches when both the input
segments and the segments in the TM are
first processed with a clause splitter
1 Rationale
Translation memory tools have had a great
impact on the translation industry as they
provide considerable assistance to translators
They allow translators to easily re-use
previous translations providing them with
valuable productivity gains in an industry
where there is a great demand for quality
translation delivered in the shortest possible
time However existing tools have some
shortcomings The majority of existing tools
rely on Levenshtein distance and seek to
identify matches only at the sentence level
Semantically similar segments are therefore
difficult to retrieve if the string similarity is not
high enough as are sub-segment matches
because if only part of a sentence matches the
input even if this part is an entire clause it is
unlikely that this sentence would be retrieved
(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text
types such as updated versions of technical
manuals
In this study we aim to address the
problem of retrieving sub-segment matches by
performing clause splitting on the source
segment as a pre-processing step for TM match
retrieval While matches for entire sentences
or almost entire sentences are the most useful
type of matches it is also less likely for such
matches to be found in most text types and
even less so for complex sentences Retrieving
clauses is desirable because there is a higher
chance for a match to be found for a clause than
for a complex sentence and at the same time
clauses are similar to sentences in that they
both contain a subject and a verb hence a
ldquocomplete thoughtrdquo therefore clause matches
are more likely to be in context and to actually
be used by the translator than phrase matches
for example
We perform experiments comparing the
match retrieval performance of TM tools when
they are used as is and when the input file and
the TM segments are first processed with a
clause splitter before being fed into the TM
tool The paper is organised as follows In
section 2 we discuss related work on TM
matching In section 3 we discuss how clause
splitting can be beneficial to TM matching and
how clause splitting was implemented for this
study We then describe our experiments in
section 4 and discuss the results in section 5
Finally we present our conclusion and future
work in section 6
Improving Translation Memory Matching through Clause
Splitting
Katerina Timonera
Research Group in Computational
Linguistics
University of Wolverhampton
krtimoneragmailcom
Ruslan Mitkov
Research Group in Computational
Linguistics
University of Wolverhampton
rmitkovwlvacuk
17
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194
Table 5 Overview on the number of segments pre-translated translated using fragments or not started
TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11
100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87
Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band
erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)
If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band
In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations
from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators
6 Conclusion and future work
In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent
We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators
We have also envisioned the combination ofalready existing methods for retrieving a higher
7
number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)
Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesKevin Flanagan 2015 Subsegment recall in Transla-
tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January
Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)
Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT
Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria
Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008
Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters
Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas
Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT
8
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015
Spotting false translation segments in translation memories
Eduard BarbuTranslatednet
eduardtranslatednet
Abstract
The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories
1 Introduction
MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface
The quality of the translations using the firstmethod is high and the errors are relatively few
1httpsmymemorytranslatednet
6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V
Figure 1 The distribution of automatically ac-quired memories in MyMemory
However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions
The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task
The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments
9
2 Related Work
The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval
There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora
A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours
2httpwwwxbenchnet
3 The data
In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types
1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web
2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)
3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments
4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)
The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors
10
It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1
CG =ls minus ldradic
34(ls + ld)(1)
In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments
If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426
The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the
3This simple idea is implemented in many sentence align-ers
4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom
Church Gale ScoreP
rob
ab
ility C
hu
rch
Ga
le S
co
re
minus4 minus2 0 2 4
00
02
04
06
08
Figure 2 The distribution of Church Gale Scoresin the Matecat Set
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
01
02
03
04
Figure 3 The distribution of Church Gale Scoresin the Collaborative Set
11
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
00
00
04
00
08
Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet
Church Gale score varies in the interval minus131516015
To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6
In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows
bull Training Set It contains 1243 bi-segmentsand has 373 negative example
bull Test Set It contains 309 bi-segments and has87 negatives examples
The proportion of the negative examples in bothsets is approximately 30
minus4 minus2 0 2 4
minus4
minus2
02
4Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 5 The Q-Q plot for the Matecat set
minus4 minus2 0 2 4
minus1
00
minus5
00
50
Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 6 The Q-Q plot for the Collaborative set
12
4 Machine Learning
In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them
41 FeaturesThe features computed for the training and test setare the following
bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt
bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language
bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0
bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0
bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0
bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0
bull has capital letters The value of the featureis 1 if the source or target segments contain
words that have at least a capital letter other-wise is 0
bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters
bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions
bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)
bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties
bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity
bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity
13
bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment
bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters
bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters
bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2
All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation
The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment
is done with the highly accurate language detectorCybozu5
42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)
bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree
bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees
bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores
bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms
bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances
bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can
5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd
14
give good results in classification problemswhere the decision boundary is irregular
5 Results and discussion
We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1
Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071
Table 1 The results of the three-fold stratifiedclassification
Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)
The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2
The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set
Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059
Table 2 The results of the classification on the testset
Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section
6 Conclusions and further work
In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the
15
performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database
There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative
Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471
ReferencesWilliam A Gale and Kenneth W Church 1993 A
program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS
B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April
Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-
lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31
Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA
F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830
Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226
Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory
Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas
16
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015
Abstract
We propose the integration of clause
splitting as a pre-processing step for match
retrieval in Translation Memory (TM)
systems to increase the number of relevant
sub-segment matches Through a series of
experiments we investigate the impact of
clause splitting in instances where the input
does not match an entire segment in the
TM but only a clause from a segment Our
results show that there is a statistically
significant increase in the number of
retrieved matches when both the input
segments and the segments in the TM are
first processed with a clause splitter
1 Rationale
Translation memory tools have had a great
impact on the translation industry as they
provide considerable assistance to translators
They allow translators to easily re-use
previous translations providing them with
valuable productivity gains in an industry
where there is a great demand for quality
translation delivered in the shortest possible
time However existing tools have some
shortcomings The majority of existing tools
rely on Levenshtein distance and seek to
identify matches only at the sentence level
Semantically similar segments are therefore
difficult to retrieve if the string similarity is not
high enough as are sub-segment matches
because if only part of a sentence matches the
input even if this part is an entire clause it is
unlikely that this sentence would be retrieved
(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text
types such as updated versions of technical
manuals
In this study we aim to address the
problem of retrieving sub-segment matches by
performing clause splitting on the source
segment as a pre-processing step for TM match
retrieval While matches for entire sentences
or almost entire sentences are the most useful
type of matches it is also less likely for such
matches to be found in most text types and
even less so for complex sentences Retrieving
clauses is desirable because there is a higher
chance for a match to be found for a clause than
for a complex sentence and at the same time
clauses are similar to sentences in that they
both contain a subject and a verb hence a
ldquocomplete thoughtrdquo therefore clause matches
are more likely to be in context and to actually
be used by the translator than phrase matches
for example
We perform experiments comparing the
match retrieval performance of TM tools when
they are used as is and when the input file and
the TM segments are first processed with a
clause splitter before being fed into the TM
tool The paper is organised as follows In
section 2 we discuss related work on TM
matching In section 3 we discuss how clause
splitting can be beneficial to TM matching and
how clause splitting was implemented for this
study We then describe our experiments in
section 4 and discuss the results in section 5
Finally we present our conclusion and future
work in section 6
Improving Translation Memory Matching through Clause
Splitting
Katerina Timonera
Research Group in Computational
Linguistics
University of Wolverhampton
krtimoneragmailcom
Ruslan Mitkov
Research Group in Computational
Linguistics
University of Wolverhampton
rmitkovwlvacuk
17
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)
Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesKevin Flanagan 2015 Subsegment recall in Transla-
tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January
Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)
Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT
Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria
Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008
Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters
Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas
Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT
8
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015
Spotting false translation segments in translation memories
Eduard BarbuTranslatednet
eduardtranslatednet
Abstract
The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories
1 Introduction
MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface
The quality of the translations using the firstmethod is high and the errors are relatively few
1httpsmymemorytranslatednet
6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V
Figure 1 The distribution of automatically ac-quired memories in MyMemory
However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions
The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task
The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments
9
2 Related Work
The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval
There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora
A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours
2httpwwwxbenchnet
3 The data
In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types
1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web
2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)
3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments
4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)
The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors
10
It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1
CG =ls minus ldradic
34(ls + ld)(1)
In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments
If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426
The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the
3This simple idea is implemented in many sentence align-ers
4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom
Church Gale ScoreP
rob
ab
ility C
hu
rch
Ga
le S
co
re
minus4 minus2 0 2 4
00
02
04
06
08
Figure 2 The distribution of Church Gale Scoresin the Matecat Set
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
01
02
03
04
Figure 3 The distribution of Church Gale Scoresin the Collaborative Set
11
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
00
00
04
00
08
Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet
Church Gale score varies in the interval minus131516015
To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6
In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows
bull Training Set It contains 1243 bi-segmentsand has 373 negative example
bull Test Set It contains 309 bi-segments and has87 negatives examples
The proportion of the negative examples in bothsets is approximately 30
minus4 minus2 0 2 4
minus4
minus2
02
4Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 5 The Q-Q plot for the Matecat set
minus4 minus2 0 2 4
minus1
00
minus5
00
50
Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 6 The Q-Q plot for the Collaborative set
12
4 Machine Learning
In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them
41 FeaturesThe features computed for the training and test setare the following
bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt
bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language
bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0
bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0
bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0
bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0
bull has capital letters The value of the featureis 1 if the source or target segments contain
words that have at least a capital letter other-wise is 0
bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters
bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions
bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)
bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties
bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity
bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity
13
bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment
bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters
bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters
bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2
All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation
The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment
is done with the highly accurate language detectorCybozu5
42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)
bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree
bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees
bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores
bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms
bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances
bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can
5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd
14
give good results in classification problemswhere the decision boundary is irregular
5 Results and discussion
We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1
Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071
Table 1 The results of the three-fold stratifiedclassification
Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)
The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2
The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set
Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059
Table 2 The results of the classification on the testset
Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section
6 Conclusions and further work
In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the
15
performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database
There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative
Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471
ReferencesWilliam A Gale and Kenneth W Church 1993 A
program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS
B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April
Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-
lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31
Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA
F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830
Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226
Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory
Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas
16
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015
Abstract
We propose the integration of clause
splitting as a pre-processing step for match
retrieval in Translation Memory (TM)
systems to increase the number of relevant
sub-segment matches Through a series of
experiments we investigate the impact of
clause splitting in instances where the input
does not match an entire segment in the
TM but only a clause from a segment Our
results show that there is a statistically
significant increase in the number of
retrieved matches when both the input
segments and the segments in the TM are
first processed with a clause splitter
1 Rationale
Translation memory tools have had a great
impact on the translation industry as they
provide considerable assistance to translators
They allow translators to easily re-use
previous translations providing them with
valuable productivity gains in an industry
where there is a great demand for quality
translation delivered in the shortest possible
time However existing tools have some
shortcomings The majority of existing tools
rely on Levenshtein distance and seek to
identify matches only at the sentence level
Semantically similar segments are therefore
difficult to retrieve if the string similarity is not
high enough as are sub-segment matches
because if only part of a sentence matches the
input even if this part is an entire clause it is
unlikely that this sentence would be retrieved
(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text
types such as updated versions of technical
manuals
In this study we aim to address the
problem of retrieving sub-segment matches by
performing clause splitting on the source
segment as a pre-processing step for TM match
retrieval While matches for entire sentences
or almost entire sentences are the most useful
type of matches it is also less likely for such
matches to be found in most text types and
even less so for complex sentences Retrieving
clauses is desirable because there is a higher
chance for a match to be found for a clause than
for a complex sentence and at the same time
clauses are similar to sentences in that they
both contain a subject and a verb hence a
ldquocomplete thoughtrdquo therefore clause matches
are more likely to be in context and to actually
be used by the translator than phrase matches
for example
We perform experiments comparing the
match retrieval performance of TM tools when
they are used as is and when the input file and
the TM segments are first processed with a
clause splitter before being fed into the TM
tool The paper is organised as follows In
section 2 we discuss related work on TM
matching In section 3 we discuss how clause
splitting can be beneficial to TM matching and
how clause splitting was implemented for this
study We then describe our experiments in
section 4 and discuss the results in section 5
Finally we present our conclusion and future
work in section 6
Improving Translation Memory Matching through Clause
Splitting
Katerina Timonera
Research Group in Computational
Linguistics
University of Wolverhampton
krtimoneragmailcom
Ruslan Mitkov
Research Group in Computational
Linguistics
University of Wolverhampton
rmitkovwlvacuk
17
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015
Spotting false translation segments in translation memories
Eduard BarbuTranslatednet
eduardtranslatednet
Abstract
The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories
1 Introduction
MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface
The quality of the translations using the firstmethod is high and the errors are relatively few
1httpsmymemorytranslatednet
6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V
Figure 1 The distribution of automatically ac-quired memories in MyMemory
However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions
The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task
The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments
9
2 Related Work
The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval
There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora
A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours
2httpwwwxbenchnet
3 The data
In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types
1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web
2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)
3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments
4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)
The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors
10
It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1
CG =ls minus ldradic
34(ls + ld)(1)
In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments
If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426
The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the
3This simple idea is implemented in many sentence align-ers
4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom
Church Gale ScoreP
rob
ab
ility C
hu
rch
Ga
le S
co
re
minus4 minus2 0 2 4
00
02
04
06
08
Figure 2 The distribution of Church Gale Scoresin the Matecat Set
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
01
02
03
04
Figure 3 The distribution of Church Gale Scoresin the Collaborative Set
11
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
00
00
04
00
08
Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet
Church Gale score varies in the interval minus131516015
To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6
In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows
bull Training Set It contains 1243 bi-segmentsand has 373 negative example
bull Test Set It contains 309 bi-segments and has87 negatives examples
The proportion of the negative examples in bothsets is approximately 30
minus4 minus2 0 2 4
minus4
minus2
02
4Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 5 The Q-Q plot for the Matecat set
minus4 minus2 0 2 4
minus1
00
minus5
00
50
Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 6 The Q-Q plot for the Collaborative set
12
4 Machine Learning
In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them
41 FeaturesThe features computed for the training and test setare the following
bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt
bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language
bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0
bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0
bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0
bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0
bull has capital letters The value of the featureis 1 if the source or target segments contain
words that have at least a capital letter other-wise is 0
bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters
bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions
bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)
bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties
bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity
bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity
13
bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment
bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters
bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters
bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2
All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation
The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment
is done with the highly accurate language detectorCybozu5
42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)
bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree
bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees
bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores
bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms
bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances
bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can
5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd
14
give good results in classification problemswhere the decision boundary is irregular
5 Results and discussion
We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1
Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071
Table 1 The results of the three-fold stratifiedclassification
Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)
The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2
The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set
Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059
Table 2 The results of the classification on the testset
Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section
6 Conclusions and further work
In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the
15
performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database
There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative
Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471
ReferencesWilliam A Gale and Kenneth W Church 1993 A
program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS
B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April
Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-
lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31
Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA
F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830
Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226
Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory
Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas
16
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015
Abstract
We propose the integration of clause
splitting as a pre-processing step for match
retrieval in Translation Memory (TM)
systems to increase the number of relevant
sub-segment matches Through a series of
experiments we investigate the impact of
clause splitting in instances where the input
does not match an entire segment in the
TM but only a clause from a segment Our
results show that there is a statistically
significant increase in the number of
retrieved matches when both the input
segments and the segments in the TM are
first processed with a clause splitter
1 Rationale
Translation memory tools have had a great
impact on the translation industry as they
provide considerable assistance to translators
They allow translators to easily re-use
previous translations providing them with
valuable productivity gains in an industry
where there is a great demand for quality
translation delivered in the shortest possible
time However existing tools have some
shortcomings The majority of existing tools
rely on Levenshtein distance and seek to
identify matches only at the sentence level
Semantically similar segments are therefore
difficult to retrieve if the string similarity is not
high enough as are sub-segment matches
because if only part of a sentence matches the
input even if this part is an entire clause it is
unlikely that this sentence would be retrieved
(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text
types such as updated versions of technical
manuals
In this study we aim to address the
problem of retrieving sub-segment matches by
performing clause splitting on the source
segment as a pre-processing step for TM match
retrieval While matches for entire sentences
or almost entire sentences are the most useful
type of matches it is also less likely for such
matches to be found in most text types and
even less so for complex sentences Retrieving
clauses is desirable because there is a higher
chance for a match to be found for a clause than
for a complex sentence and at the same time
clauses are similar to sentences in that they
both contain a subject and a verb hence a
ldquocomplete thoughtrdquo therefore clause matches
are more likely to be in context and to actually
be used by the translator than phrase matches
for example
We perform experiments comparing the
match retrieval performance of TM tools when
they are used as is and when the input file and
the TM segments are first processed with a
clause splitter before being fed into the TM
tool The paper is organised as follows In
section 2 we discuss related work on TM
matching In section 3 we discuss how clause
splitting can be beneficial to TM matching and
how clause splitting was implemented for this
study We then describe our experiments in
section 4 and discuss the results in section 5
Finally we present our conclusion and future
work in section 6
Improving Translation Memory Matching through Clause
Splitting
Katerina Timonera
Research Group in Computational
Linguistics
University of Wolverhampton
krtimoneragmailcom
Ruslan Mitkov
Research Group in Computational
Linguistics
University of Wolverhampton
rmitkovwlvacuk
17
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
2 Related Work
The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval
There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora
A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours
2httpwwwxbenchnet
3 The data
In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types
1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web
2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)
3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments
4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)
The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors
10
It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1
CG =ls minus ldradic
34(ls + ld)(1)
In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments
If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426
The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the
3This simple idea is implemented in many sentence align-ers
4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom
Church Gale ScoreP
rob
ab
ility C
hu
rch
Ga
le S
co
re
minus4 minus2 0 2 4
00
02
04
06
08
Figure 2 The distribution of Church Gale Scoresin the Matecat Set
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
01
02
03
04
Figure 3 The distribution of Church Gale Scoresin the Collaborative Set
11
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
00
00
04
00
08
Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet
Church Gale score varies in the interval minus131516015
To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6
In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows
bull Training Set It contains 1243 bi-segmentsand has 373 negative example
bull Test Set It contains 309 bi-segments and has87 negatives examples
The proportion of the negative examples in bothsets is approximately 30
minus4 minus2 0 2 4
minus4
minus2
02
4Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 5 The Q-Q plot for the Matecat set
minus4 minus2 0 2 4
minus1
00
minus5
00
50
Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 6 The Q-Q plot for the Collaborative set
12
4 Machine Learning
In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them
41 FeaturesThe features computed for the training and test setare the following
bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt
bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language
bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0
bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0
bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0
bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0
bull has capital letters The value of the featureis 1 if the source or target segments contain
words that have at least a capital letter other-wise is 0
bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters
bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions
bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)
bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties
bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity
bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity
13
bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment
bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters
bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters
bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2
All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation
The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment
is done with the highly accurate language detectorCybozu5
42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)
bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree
bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees
bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores
bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms
bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances
bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can
5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd
14
give good results in classification problemswhere the decision boundary is irregular
5 Results and discussion
We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1
Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071
Table 1 The results of the three-fold stratifiedclassification
Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)
The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2
The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set
Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059
Table 2 The results of the classification on the testset
Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section
6 Conclusions and further work
In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the
15
performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database
There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative
Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471
ReferencesWilliam A Gale and Kenneth W Church 1993 A
program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS
B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April
Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-
lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31
Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA
F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830
Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226
Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory
Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas
16
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015
Abstract
We propose the integration of clause
splitting as a pre-processing step for match
retrieval in Translation Memory (TM)
systems to increase the number of relevant
sub-segment matches Through a series of
experiments we investigate the impact of
clause splitting in instances where the input
does not match an entire segment in the
TM but only a clause from a segment Our
results show that there is a statistically
significant increase in the number of
retrieved matches when both the input
segments and the segments in the TM are
first processed with a clause splitter
1 Rationale
Translation memory tools have had a great
impact on the translation industry as they
provide considerable assistance to translators
They allow translators to easily re-use
previous translations providing them with
valuable productivity gains in an industry
where there is a great demand for quality
translation delivered in the shortest possible
time However existing tools have some
shortcomings The majority of existing tools
rely on Levenshtein distance and seek to
identify matches only at the sentence level
Semantically similar segments are therefore
difficult to retrieve if the string similarity is not
high enough as are sub-segment matches
because if only part of a sentence matches the
input even if this part is an entire clause it is
unlikely that this sentence would be retrieved
(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text
types such as updated versions of technical
manuals
In this study we aim to address the
problem of retrieving sub-segment matches by
performing clause splitting on the source
segment as a pre-processing step for TM match
retrieval While matches for entire sentences
or almost entire sentences are the most useful
type of matches it is also less likely for such
matches to be found in most text types and
even less so for complex sentences Retrieving
clauses is desirable because there is a higher
chance for a match to be found for a clause than
for a complex sentence and at the same time
clauses are similar to sentences in that they
both contain a subject and a verb hence a
ldquocomplete thoughtrdquo therefore clause matches
are more likely to be in context and to actually
be used by the translator than phrase matches
for example
We perform experiments comparing the
match retrieval performance of TM tools when
they are used as is and when the input file and
the TM segments are first processed with a
clause splitter before being fed into the TM
tool The paper is organised as follows In
section 2 we discuss related work on TM
matching In section 3 we discuss how clause
splitting can be beneficial to TM matching and
how clause splitting was implemented for this
study We then describe our experiments in
section 4 and discuss the results in section 5
Finally we present our conclusion and future
work in section 6
Improving Translation Memory Matching through Clause
Splitting
Katerina Timonera
Research Group in Computational
Linguistics
University of Wolverhampton
krtimoneragmailcom
Ruslan Mitkov
Research Group in Computational
Linguistics
University of Wolverhampton
rmitkovwlvacuk
17
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1
CG =ls minus ldradic
34(ls + ld)(1)
In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments
If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426
The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the
3This simple idea is implemented in many sentence align-ers
4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom
Church Gale ScoreP
rob
ab
ility C
hu
rch
Ga
le S
co
re
minus4 minus2 0 2 4
00
02
04
06
08
Figure 2 The distribution of Church Gale Scoresin the Matecat Set
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
01
02
03
04
Figure 3 The distribution of Church Gale Scoresin the Collaborative Set
11
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
00
00
04
00
08
Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet
Church Gale score varies in the interval minus131516015
To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6
In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows
bull Training Set It contains 1243 bi-segmentsand has 373 negative example
bull Test Set It contains 309 bi-segments and has87 negatives examples
The proportion of the negative examples in bothsets is approximately 30
minus4 minus2 0 2 4
minus4
minus2
02
4Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 5 The Q-Q plot for the Matecat set
minus4 minus2 0 2 4
minus1
00
minus5
00
50
Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 6 The Q-Q plot for the Collaborative set
12
4 Machine Learning
In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them
41 FeaturesThe features computed for the training and test setare the following
bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt
bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language
bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0
bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0
bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0
bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0
bull has capital letters The value of the featureis 1 if the source or target segments contain
words that have at least a capital letter other-wise is 0
bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters
bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions
bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)
bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties
bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity
bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity
13
bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment
bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters
bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters
bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2
All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation
The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment
is done with the highly accurate language detectorCybozu5
42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)
bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree
bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees
bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores
bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms
bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances
bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can
5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd
14
give good results in classification problemswhere the decision boundary is irregular
5 Results and discussion
We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1
Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071
Table 1 The results of the three-fold stratifiedclassification
Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)
The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2
The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set
Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059
Table 2 The results of the classification on the testset
Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section
6 Conclusions and further work
In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the
15
performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database
There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative
Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471
ReferencesWilliam A Gale and Kenneth W Church 1993 A
program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS
B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April
Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-
lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31
Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA
F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830
Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226
Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory
Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas
16
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015
Abstract
We propose the integration of clause
splitting as a pre-processing step for match
retrieval in Translation Memory (TM)
systems to increase the number of relevant
sub-segment matches Through a series of
experiments we investigate the impact of
clause splitting in instances where the input
does not match an entire segment in the
TM but only a clause from a segment Our
results show that there is a statistically
significant increase in the number of
retrieved matches when both the input
segments and the segments in the TM are
first processed with a clause splitter
1 Rationale
Translation memory tools have had a great
impact on the translation industry as they
provide considerable assistance to translators
They allow translators to easily re-use
previous translations providing them with
valuable productivity gains in an industry
where there is a great demand for quality
translation delivered in the shortest possible
time However existing tools have some
shortcomings The majority of existing tools
rely on Levenshtein distance and seek to
identify matches only at the sentence level
Semantically similar segments are therefore
difficult to retrieve if the string similarity is not
high enough as are sub-segment matches
because if only part of a sentence matches the
input even if this part is an entire clause it is
unlikely that this sentence would be retrieved
(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text
types such as updated versions of technical
manuals
In this study we aim to address the
problem of retrieving sub-segment matches by
performing clause splitting on the source
segment as a pre-processing step for TM match
retrieval While matches for entire sentences
or almost entire sentences are the most useful
type of matches it is also less likely for such
matches to be found in most text types and
even less so for complex sentences Retrieving
clauses is desirable because there is a higher
chance for a match to be found for a clause than
for a complex sentence and at the same time
clauses are similar to sentences in that they
both contain a subject and a verb hence a
ldquocomplete thoughtrdquo therefore clause matches
are more likely to be in context and to actually
be used by the translator than phrase matches
for example
We perform experiments comparing the
match retrieval performance of TM tools when
they are used as is and when the input file and
the TM segments are first processed with a
clause splitter before being fed into the TM
tool The paper is organised as follows In
section 2 we discuss related work on TM
matching In section 3 we discuss how clause
splitting can be beneficial to TM matching and
how clause splitting was implemented for this
study We then describe our experiments in
section 4 and discuss the results in section 5
Finally we present our conclusion and future
work in section 6
Improving Translation Memory Matching through Clause
Splitting
Katerina Timonera
Research Group in Computational
Linguistics
University of Wolverhampton
krtimoneragmailcom
Ruslan Mitkov
Research Group in Computational
Linguistics
University of Wolverhampton
rmitkovwlvacuk
17
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
Church Gale Score
Pro
ba
bility C
hu
rch
Ga
le S
co
re
minus100 minus50 0 50
00
00
00
04
00
08
Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet
Church Gale score varies in the interval minus131516015
To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6
In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows
bull Training Set It contains 1243 bi-segmentsand has 373 negative example
bull Test Set It contains 309 bi-segments and has87 negatives examples
The proportion of the negative examples in bothsets is approximately 30
minus4 minus2 0 2 4
minus4
minus2
02
4Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 5 The Q-Q plot for the Matecat set
minus4 minus2 0 2 4
minus1
00
minus5
00
50
Theoretical Quantiles
Sa
mp
le Q
ua
ntile
s
Figure 6 The Q-Q plot for the Collaborative set
12
4 Machine Learning
In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them
41 FeaturesThe features computed for the training and test setare the following
bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt
bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language
bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0
bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0
bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0
bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0
bull has capital letters The value of the featureis 1 if the source or target segments contain
words that have at least a capital letter other-wise is 0
bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters
bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions
bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)
bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties
bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity
bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity
13
bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment
bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters
bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters
bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2
All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation
The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment
is done with the highly accurate language detectorCybozu5
42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)
bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree
bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees
bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores
bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms
bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances
bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can
5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd
14
give good results in classification problemswhere the decision boundary is irregular
5 Results and discussion
We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1
Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071
Table 1 The results of the three-fold stratifiedclassification
Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)
The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2
The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set
Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059
Table 2 The results of the classification on the testset
Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section
6 Conclusions and further work
In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the
15
performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database
There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative
Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471
ReferencesWilliam A Gale and Kenneth W Church 1993 A
program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS
B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April
Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-
lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31
Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA
F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830
Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226
Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory
Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas
16
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015
Abstract
We propose the integration of clause
splitting as a pre-processing step for match
retrieval in Translation Memory (TM)
systems to increase the number of relevant
sub-segment matches Through a series of
experiments we investigate the impact of
clause splitting in instances where the input
does not match an entire segment in the
TM but only a clause from a segment Our
results show that there is a statistically
significant increase in the number of
retrieved matches when both the input
segments and the segments in the TM are
first processed with a clause splitter
1 Rationale
Translation memory tools have had a great
impact on the translation industry as they
provide considerable assistance to translators
They allow translators to easily re-use
previous translations providing them with
valuable productivity gains in an industry
where there is a great demand for quality
translation delivered in the shortest possible
time However existing tools have some
shortcomings The majority of existing tools
rely on Levenshtein distance and seek to
identify matches only at the sentence level
Semantically similar segments are therefore
difficult to retrieve if the string similarity is not
high enough as are sub-segment matches
because if only part of a sentence matches the
input even if this part is an entire clause it is
unlikely that this sentence would be retrieved
(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text
types such as updated versions of technical
manuals
In this study we aim to address the
problem of retrieving sub-segment matches by
performing clause splitting on the source
segment as a pre-processing step for TM match
retrieval While matches for entire sentences
or almost entire sentences are the most useful
type of matches it is also less likely for such
matches to be found in most text types and
even less so for complex sentences Retrieving
clauses is desirable because there is a higher
chance for a match to be found for a clause than
for a complex sentence and at the same time
clauses are similar to sentences in that they
both contain a subject and a verb hence a
ldquocomplete thoughtrdquo therefore clause matches
are more likely to be in context and to actually
be used by the translator than phrase matches
for example
We perform experiments comparing the
match retrieval performance of TM tools when
they are used as is and when the input file and
the TM segments are first processed with a
clause splitter before being fed into the TM
tool The paper is organised as follows In
section 2 we discuss related work on TM
matching In section 3 we discuss how clause
splitting can be beneficial to TM matching and
how clause splitting was implemented for this
study We then describe our experiments in
section 4 and discuss the results in section 5
Finally we present our conclusion and future
work in section 6
Improving Translation Memory Matching through Clause
Splitting
Katerina Timonera
Research Group in Computational
Linguistics
University of Wolverhampton
krtimoneragmailcom
Ruslan Mitkov
Research Group in Computational
Linguistics
University of Wolverhampton
rmitkovwlvacuk
17
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
4 Machine Learning
In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them
41 FeaturesThe features computed for the training and test setare the following
bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt
bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language
bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0
bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0
bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0
bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0
bull has capital letters The value of the featureis 1 if the source or target segments contain
words that have at least a capital letter other-wise is 0
bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters
bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions
bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)
bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties
bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity
bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity
13
bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment
bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters
bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters
bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2
All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation
The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment
is done with the highly accurate language detectorCybozu5
42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)
bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree
bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees
bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores
bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms
bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances
bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can
5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd
14
give good results in classification problemswhere the decision boundary is irregular
5 Results and discussion
We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1
Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071
Table 1 The results of the three-fold stratifiedclassification
Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)
The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2
The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set
Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059
Table 2 The results of the classification on the testset
Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section
6 Conclusions and further work
In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the
15
performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database
There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative
Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471
ReferencesWilliam A Gale and Kenneth W Church 1993 A
program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS
B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April
Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-
lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31
Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA
F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830
Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226
Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory
Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas
16
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015
Abstract
We propose the integration of clause
splitting as a pre-processing step for match
retrieval in Translation Memory (TM)
systems to increase the number of relevant
sub-segment matches Through a series of
experiments we investigate the impact of
clause splitting in instances where the input
does not match an entire segment in the
TM but only a clause from a segment Our
results show that there is a statistically
significant increase in the number of
retrieved matches when both the input
segments and the segments in the TM are
first processed with a clause splitter
1 Rationale
Translation memory tools have had a great
impact on the translation industry as they
provide considerable assistance to translators
They allow translators to easily re-use
previous translations providing them with
valuable productivity gains in an industry
where there is a great demand for quality
translation delivered in the shortest possible
time However existing tools have some
shortcomings The majority of existing tools
rely on Levenshtein distance and seek to
identify matches only at the sentence level
Semantically similar segments are therefore
difficult to retrieve if the string similarity is not
high enough as are sub-segment matches
because if only part of a sentence matches the
input even if this part is an entire clause it is
unlikely that this sentence would be retrieved
(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text
types such as updated versions of technical
manuals
In this study we aim to address the
problem of retrieving sub-segment matches by
performing clause splitting on the source
segment as a pre-processing step for TM match
retrieval While matches for entire sentences
or almost entire sentences are the most useful
type of matches it is also less likely for such
matches to be found in most text types and
even less so for complex sentences Retrieving
clauses is desirable because there is a higher
chance for a match to be found for a clause than
for a complex sentence and at the same time
clauses are similar to sentences in that they
both contain a subject and a verb hence a
ldquocomplete thoughtrdquo therefore clause matches
are more likely to be in context and to actually
be used by the translator than phrase matches
for example
We perform experiments comparing the
match retrieval performance of TM tools when
they are used as is and when the input file and
the TM segments are first processed with a
clause splitter before being fed into the TM
tool The paper is organised as follows In
section 2 we discuss related work on TM
matching In section 3 we discuss how clause
splitting can be beneficial to TM matching and
how clause splitting was implemented for this
study We then describe our experiments in
section 4 and discuss the results in section 5
Finally we present our conclusion and future
work in section 6
Improving Translation Memory Matching through Clause
Splitting
Katerina Timonera
Research Group in Computational
Linguistics
University of Wolverhampton
krtimoneragmailcom
Ruslan Mitkov
Research Group in Computational
Linguistics
University of Wolverhampton
rmitkovwlvacuk
17
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment
bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters
bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters
bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2
All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation
The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment
is done with the highly accurate language detectorCybozu5
42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)
bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree
bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees
bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores
bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms
bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances
bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can
5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd
14
give good results in classification problemswhere the decision boundary is irregular
5 Results and discussion
We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1
Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071
Table 1 The results of the three-fold stratifiedclassification
Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)
The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2
The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set
Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059
Table 2 The results of the classification on the testset
Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section
6 Conclusions and further work
In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the
15
performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database
There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative
Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471
ReferencesWilliam A Gale and Kenneth W Church 1993 A
program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS
B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April
Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-
lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31
Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA
F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830
Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226
Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory
Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas
16
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015
Abstract
We propose the integration of clause
splitting as a pre-processing step for match
retrieval in Translation Memory (TM)
systems to increase the number of relevant
sub-segment matches Through a series of
experiments we investigate the impact of
clause splitting in instances where the input
does not match an entire segment in the
TM but only a clause from a segment Our
results show that there is a statistically
significant increase in the number of
retrieved matches when both the input
segments and the segments in the TM are
first processed with a clause splitter
1 Rationale
Translation memory tools have had a great
impact on the translation industry as they
provide considerable assistance to translators
They allow translators to easily re-use
previous translations providing them with
valuable productivity gains in an industry
where there is a great demand for quality
translation delivered in the shortest possible
time However existing tools have some
shortcomings The majority of existing tools
rely on Levenshtein distance and seek to
identify matches only at the sentence level
Semantically similar segments are therefore
difficult to retrieve if the string similarity is not
high enough as are sub-segment matches
because if only part of a sentence matches the
input even if this part is an entire clause it is
unlikely that this sentence would be retrieved
(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text
types such as updated versions of technical
manuals
In this study we aim to address the
problem of retrieving sub-segment matches by
performing clause splitting on the source
segment as a pre-processing step for TM match
retrieval While matches for entire sentences
or almost entire sentences are the most useful
type of matches it is also less likely for such
matches to be found in most text types and
even less so for complex sentences Retrieving
clauses is desirable because there is a higher
chance for a match to be found for a clause than
for a complex sentence and at the same time
clauses are similar to sentences in that they
both contain a subject and a verb hence a
ldquocomplete thoughtrdquo therefore clause matches
are more likely to be in context and to actually
be used by the translator than phrase matches
for example
We perform experiments comparing the
match retrieval performance of TM tools when
they are used as is and when the input file and
the TM segments are first processed with a
clause splitter before being fed into the TM
tool The paper is organised as follows In
section 2 we discuss related work on TM
matching In section 3 we discuss how clause
splitting can be beneficial to TM matching and
how clause splitting was implemented for this
study We then describe our experiments in
section 4 and discuss the results in section 5
Finally we present our conclusion and future
work in section 6
Improving Translation Memory Matching through Clause
Splitting
Katerina Timonera
Research Group in Computational
Linguistics
University of Wolverhampton
krtimoneragmailcom
Ruslan Mitkov
Research Group in Computational
Linguistics
University of Wolverhampton
rmitkovwlvacuk
17
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
give good results in classification problemswhere the decision boundary is irregular
5 Results and discussion
We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1
Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071
Table 1 The results of the three-fold stratifiedclassification
Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)
The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2
The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set
Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059
Table 2 The results of the classification on the testset
Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section
6 Conclusions and further work
In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the
15
performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database
There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative
Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471
ReferencesWilliam A Gale and Kenneth W Church 1993 A
program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS
B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April
Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-
lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31
Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA
F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830
Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226
Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory
Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas
16
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015
Abstract
We propose the integration of clause
splitting as a pre-processing step for match
retrieval in Translation Memory (TM)
systems to increase the number of relevant
sub-segment matches Through a series of
experiments we investigate the impact of
clause splitting in instances where the input
does not match an entire segment in the
TM but only a clause from a segment Our
results show that there is a statistically
significant increase in the number of
retrieved matches when both the input
segments and the segments in the TM are
first processed with a clause splitter
1 Rationale
Translation memory tools have had a great
impact on the translation industry as they
provide considerable assistance to translators
They allow translators to easily re-use
previous translations providing them with
valuable productivity gains in an industry
where there is a great demand for quality
translation delivered in the shortest possible
time However existing tools have some
shortcomings The majority of existing tools
rely on Levenshtein distance and seek to
identify matches only at the sentence level
Semantically similar segments are therefore
difficult to retrieve if the string similarity is not
high enough as are sub-segment matches
because if only part of a sentence matches the
input even if this part is an entire clause it is
unlikely that this sentence would be retrieved
(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text
types such as updated versions of technical
manuals
In this study we aim to address the
problem of retrieving sub-segment matches by
performing clause splitting on the source
segment as a pre-processing step for TM match
retrieval While matches for entire sentences
or almost entire sentences are the most useful
type of matches it is also less likely for such
matches to be found in most text types and
even less so for complex sentences Retrieving
clauses is desirable because there is a higher
chance for a match to be found for a clause than
for a complex sentence and at the same time
clauses are similar to sentences in that they
both contain a subject and a verb hence a
ldquocomplete thoughtrdquo therefore clause matches
are more likely to be in context and to actually
be used by the translator than phrase matches
for example
We perform experiments comparing the
match retrieval performance of TM tools when
they are used as is and when the input file and
the TM segments are first processed with a
clause splitter before being fed into the TM
tool The paper is organised as follows In
section 2 we discuss related work on TM
matching In section 3 we discuss how clause
splitting can be beneficial to TM matching and
how clause splitting was implemented for this
study We then describe our experiments in
section 4 and discuss the results in section 5
Finally we present our conclusion and future
work in section 6
Improving Translation Memory Matching through Clause
Splitting
Katerina Timonera
Research Group in Computational
Linguistics
University of Wolverhampton
krtimoneragmailcom
Ruslan Mitkov
Research Group in Computational
Linguistics
University of Wolverhampton
rmitkovwlvacuk
17
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database
There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative
Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem
Acknowledgments
The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471
ReferencesWilliam A Gale and Kenneth W Church 1993 A
program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS
B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April
Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-
lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31
Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA
F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830
Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226
Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory
Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas
16
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015
Abstract
We propose the integration of clause
splitting as a pre-processing step for match
retrieval in Translation Memory (TM)
systems to increase the number of relevant
sub-segment matches Through a series of
experiments we investigate the impact of
clause splitting in instances where the input
does not match an entire segment in the
TM but only a clause from a segment Our
results show that there is a statistically
significant increase in the number of
retrieved matches when both the input
segments and the segments in the TM are
first processed with a clause splitter
1 Rationale
Translation memory tools have had a great
impact on the translation industry as they
provide considerable assistance to translators
They allow translators to easily re-use
previous translations providing them with
valuable productivity gains in an industry
where there is a great demand for quality
translation delivered in the shortest possible
time However existing tools have some
shortcomings The majority of existing tools
rely on Levenshtein distance and seek to
identify matches only at the sentence level
Semantically similar segments are therefore
difficult to retrieve if the string similarity is not
high enough as are sub-segment matches
because if only part of a sentence matches the
input even if this part is an entire clause it is
unlikely that this sentence would be retrieved
(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text
types such as updated versions of technical
manuals
In this study we aim to address the
problem of retrieving sub-segment matches by
performing clause splitting on the source
segment as a pre-processing step for TM match
retrieval While matches for entire sentences
or almost entire sentences are the most useful
type of matches it is also less likely for such
matches to be found in most text types and
even less so for complex sentences Retrieving
clauses is desirable because there is a higher
chance for a match to be found for a clause than
for a complex sentence and at the same time
clauses are similar to sentences in that they
both contain a subject and a verb hence a
ldquocomplete thoughtrdquo therefore clause matches
are more likely to be in context and to actually
be used by the translator than phrase matches
for example
We perform experiments comparing the
match retrieval performance of TM tools when
they are used as is and when the input file and
the TM segments are first processed with a
clause splitter before being fed into the TM
tool The paper is organised as follows In
section 2 we discuss related work on TM
matching In section 3 we discuss how clause
splitting can be beneficial to TM matching and
how clause splitting was implemented for this
study We then describe our experiments in
section 4 and discuss the results in section 5
Finally we present our conclusion and future
work in section 6
Improving Translation Memory Matching through Clause
Splitting
Katerina Timonera
Research Group in Computational
Linguistics
University of Wolverhampton
krtimoneragmailcom
Ruslan Mitkov
Research Group in Computational
Linguistics
University of Wolverhampton
rmitkovwlvacuk
17
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015
Abstract
We propose the integration of clause
splitting as a pre-processing step for match
retrieval in Translation Memory (TM)
systems to increase the number of relevant
sub-segment matches Through a series of
experiments we investigate the impact of
clause splitting in instances where the input
does not match an entire segment in the
TM but only a clause from a segment Our
results show that there is a statistically
significant increase in the number of
retrieved matches when both the input
segments and the segments in the TM are
first processed with a clause splitter
1 Rationale
Translation memory tools have had a great
impact on the translation industry as they
provide considerable assistance to translators
They allow translators to easily re-use
previous translations providing them with
valuable productivity gains in an industry
where there is a great demand for quality
translation delivered in the shortest possible
time However existing tools have some
shortcomings The majority of existing tools
rely on Levenshtein distance and seek to
identify matches only at the sentence level
Semantically similar segments are therefore
difficult to retrieve if the string similarity is not
high enough as are sub-segment matches
because if only part of a sentence matches the
input even if this part is an entire clause it is
unlikely that this sentence would be retrieved
(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text
types such as updated versions of technical
manuals
In this study we aim to address the
problem of retrieving sub-segment matches by
performing clause splitting on the source
segment as a pre-processing step for TM match
retrieval While matches for entire sentences
or almost entire sentences are the most useful
type of matches it is also less likely for such
matches to be found in most text types and
even less so for complex sentences Retrieving
clauses is desirable because there is a higher
chance for a match to be found for a clause than
for a complex sentence and at the same time
clauses are similar to sentences in that they
both contain a subject and a verb hence a
ldquocomplete thoughtrdquo therefore clause matches
are more likely to be in context and to actually
be used by the translator than phrase matches
for example
We perform experiments comparing the
match retrieval performance of TM tools when
they are used as is and when the input file and
the TM segments are first processed with a
clause splitter before being fed into the TM
tool The paper is organised as follows In
section 2 we discuss related work on TM
matching In section 3 we discuss how clause
splitting can be beneficial to TM matching and
how clause splitting was implemented for this
study We then describe our experiments in
section 4 and discuss the results in section 5
Finally we present our conclusion and future
work in section 6
Improving Translation Memory Matching through Clause
Splitting
Katerina Timonera
Research Group in Computational
Linguistics
University of Wolverhampton
krtimoneragmailcom
Ruslan Mitkov
Research Group in Computational
Linguistics
University of Wolverhampton
rmitkovwlvacuk
17
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
2 Related Work
Attempts to address the shortcomings of
existing tools include the integration of
language processing to break down a sentence
into smaller segments The so-called lsquosecond
generationrsquo TM system Similis (Planas 2005)
performs chunking to split sentences into syntagmas to allow sub-sentence matching
However Reinke (2013) observes that for
certain language pairs like English-German
only rather short phrases like simple NPs are
identified and larger syntactic units cannot be
retrieved This can be regarded as disadvantage
as the processing of larger units would be
desirable for the support of professional
computer-assisted human translation (Kriele
(2006) and Macken (2009) cited in Reinke
2013)
MetaMorphoTM (Hodaacutesz and Pohl 2005)
also divides sentences into smaller chunks Moreover it uses a multi-level linguistic
similarity technique (surface form lemma
word class) to determine similarity between
two source-language segments
Other attempts involve deeper linguistic
processing techniques In Pekar and Mitkov
(2007) we propose the lsquothird-generation
translation memoryrsquo which introduces the
concept of semantic matching We employ
syntactic and semantic analysis of segments
stored in a TM to produce a generalised
representation of segments which reduces
equivalent lexical syntactic and lexico-
syntactic constructions into a single
representation Then a retrieval mechanism
operating on these generalised
representations is used to search for useful
previous translations in the TM
This study is part of the third-generation
translation memory project of which the
ultimate goal is to produce more intelligent TM
systems using NLP techniques To the best of
our knowledge clause splitting has not
previously been investigated as a possible
method for increasing the number of relevant
retrieved matches
3 Clause splitting for TM matching
Macklovitch and Russel (2000) note that
when a sufficiently close match cannot be
found for a new input sentence current TM
systems are unable to retrieve sentences that
contain the same clauses or other major
phrases
For example (a) below is a new input
sentence composed of twenty five-character
words The TM contains the sentence (b)
which shares an identical substring with
sentence (a) However as this substring only
makes up only 25 of the sentences total
number of characters it is unlikely that current
TM tools would be able to retrieve it as a fuzzy
match
(a) w1 w2 w3 w4 w5 w6 w20
(b) w1 w2 w3 w4 w5 w21 w35
If clause splitting were employed clauses
would be treated as separate segments thus
increasing the likelihood that clauses which
are subparts of larger units could have a match
score sufficiently high to be retrieved by the
TM system
In this paper we compare the effect of
performing clause splitting before retrieving
matches in a TM In each experiment we
identify the difference in matching
performance when a TM tool is used as is and
when the input file and the translation memory
are first run through a clause splitter The
clause splitter we use in this study is a modified version of the one described in Puscasu (2004)
The original version employs both machine
learning and linguistic rules to identify finite
clauses for both English and Romanian but in
this version only the rule-based module is
used Puscasu (2004) developed a clause
splitting method for both English and
Romanian and to maintain consistency
between the two languages her definition of a
clause is the one prescribed by the Romanian
Academy of Grammar which is that a clause is
18
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
group of words containing a finite verb
Nonfinite and verbless clauses are therefore
not considered The reported F-measure for
identifying complete clauses in English is
8139 (Marsic 2011)
In this study the clause splitter is used on
both the segments in the input file and the translation memory database After processing
the input and the TM segments with the clause
splitter these were then imported into existing
TM tools to examine how well these tools will
perform if clause splitting is used in pre-
processing
4 Experiments
Experiments were performed to study the
impact of clause splitting when used in pre-
processing for the retrieval of segments in a
TM Our hypothesis is that when a clause
splitter is used the number of relevant
retrieved matches will increase
The effect of clause splitting is examined by
comparing the number of matches retrieved
when TM tools are used as is and when both
the input segments and the segments in the
translation memory are first processed with a
clause splitter before being imported into the
TM tools The tools used are Wordfast
Professional 3 and Trados Studio 2009 which
are among the most widely used TM tools
(Lagoudaki 2006)
Segments used as the input were selected
from the Edinburgh paraphrase corpus (Cohn
Callison-Burch and Lapata 2008) (in Macken
2009) We use a paraphrase corpus because we
wish to investigate the effect of using a clause
splitter in pre-processing to retrieve both
segments that contain the entire input clause
and segments that do not contain the exact
input clause but may still be relevant as they
contain a clause that shares a considerable
degree of similarity with the input
We examine the segments retrieved using
both the default fuzzy match threshold (75
for Wordfast and 70 for SDL Trados) and the
minimum threshold (40 for Wordfast and
30 for Trados) It is not normally
recommended for translators to set a low fuzzy
match threshold as this might result in the
retrieval of too many irrelevant segments if the
translation memory is large However in this
study we argue it would be beneficial to
examine matches retrieved with the minimum
threshold as well Given that translation
memory match scores are mainly calculated
using Levenshtein distance if only one clause
in a segment in the TM matches the input there
is a greater chance of the segment being
retrieved with a lower threshold We therefore
wish to examine whether the employment of
clause splitting will still result in a considerable
improvement from using the Levenshtein
distance-based matching algorithm in most TM
tools if the match threshold setting is already
optimised for the retrieval of sub-segment
clauses
It must also be noted that for this study we
are working with the source segments only
Therefore in the TM files used both the source
and target segments are in English
We conducted two main sets of experiments
referred to as Set A and Set B which are
outlined below In Set A we selected sentences
from the Edinburgh corpus that contained
more than one clause We use one clause or
part of it as the input segment and we store
the entire sentence in the TM In the
experiments where no clause splitting is done
the sentence is stored as is In the experiments
with clause splitting the original input
segments are split into clauses (if there are more than one) and the segments in the TM are
the component clauses of the original sentence
For the experiments done without clause
splitting there are 150 input segments and for
each one we test whether the longer
corresponding segment in the TM can be
retrieved For the experiments where clause
splitting is used the 150 input segments are
split into 180 segments as some of these
segments contain more than one clause We
then test whether the corresponding clause
from the original longer sentence can be
retrieved An example is presented in Table 1
19
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
The underlined segments are the
corresponding segments that should be
retrieved
In Set B there are also 150 input segments
for the experiments where no clause splitting
is used and the corresponding segment in the
TM is a longer sentence containing a paraphrase of the input segment For the
experiments with clause splitting there are
185 input segments (as in set A some of the
original 150 have more than one clause) and in
the TM the component clauses of the original
longer segment are stored and we test
whether the clause that is a paraphrase of the
input can be retrieved Below is an example
Without clause splitting
Input Segment in TM
the ministry of defense
once indicated
that about 20000
soldiers were missing in
the korean war
the ministry of defense
once indicated that about
20000 soldiers were
missing in the korean war
and that the ministry of
defense believes there
may still be some
survivors
With clause splitting
Input Segments in TM
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- the ministry of defense
once indicated
- that about 20000
soldiers were missing in
the korean war
- and that the ministry of
defense believes
- there may still be some
survivors
Table 1 Set A ExampleWithout clause
splitting
Input Segment in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
a member of the rap
group so solid crew threw
away a loaded gun during
a police chase southwark
crown court was told
yesterday
With clause splitting
Input Segments in TM
a member of the chart-
topping collective so
solid crew dumped a
loaded pistol in an
alleyway
- a member of the rap
group so solid crew threw
away a loaded gun during
a police chase
- southwark crown court
was told yesterday
Table 2 Set B Example
5 Results
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 2333 9000
Retrieved
(Minimum
threshold) 3800 9222
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 1400 8889
Retrieved
(Minimum
threshold) 1400 9667
Table 3 Percentage of correctly retrieved
segments in Set A
Table 3 shows the results of the experiments in
set A It is clear that clause splitting
considerably increases the number of matches
in instances where the input segment can be
found in a longer segment stored in the TM
When the corresponding segments that could
not be retrieved even with the minimum
threshold were analysed we found that in set
A all instances were due to errors in clause
splitting more specifically the fact that the
clause splitter failed to split a sentence
containing more than one clause
Table 4 summarises the percentage of
correctly retrieved segments in set B In this
set it was observed that although the
percentage of retrieved matches is generally
lower than the percentages in set A there is
20
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
still a noticeable increase in the percentage of
matches retrieved
Table 4 Percentage of correctly retrieved
segments in Set B
Upon examination of the segments that
could not be retrieved even with the default
threshold we found that in both Wordfast and
Trados around 24 had clause splitting
errors such as when a segment is not split at
all when it has more than one clause or when
the segment is incorrectly split As for the rest
of the unretrieved segments we presume that
they are so heavily paraphrased that even
when clause splitting is performed correctly
the TM tools are still unable to retrieve them
For each experiment we conduct a paired t-
test using the match scores produced by the
TM tools when retrieving each segment (Table
5) When there are no matches or the correct
match is not retrieved the match score is 0 In
instances where the original input segment has
more than one clause and is thus split by the
clause splitter we take the average match
score of the clauses and take this as one case in
order to make the results comparable In all
experiments the difference is significant at the
00001 level when computed with SPSS We
can therefore reject the null hypothesis and
conclude that there is a statistically significant
difference between the results
SET A
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 1923 8530888891
0000
Minimum
threshold 2928 8648888891
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 1178 8387777781 0000
Minimum
threshold 1178 8807111113 0000
SET B
WORDFAST
Mean
p-value
Without
clause
splitting
With clause
splitting
Default
threshold 223 1721111111
0000
Minimum
threshold 649 3062111112
0000
TRADOS
Without
clause
splitting
With clause
splitting p-value
Default
threshold 271 23083
0000
Minimum
threshold 1683 4636777779
0000
Table 5 Paired t-test on all experiments
6 Conclusion
Our results show that introducing clause
splitting as a pre-processing step in TM match
retrieval can significantly increase matching
WORDFAST
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 267 1784
Retrieved
(Minimum
threshold) 1000 4108
TRADOS
Wo clause
splitting
W clause
splitting
Retrieved
(Default
threshold) 333 2595
Retrieved
(Minimum
threshold) 3600 7067
21
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
performance in instances where the TM
contains segments of which one of the clauses
corresponds to the input segment or is a
paraphrase of the input segment
It is worth mentioning that the data used in
these experiments are not data imported from
the translation memories of practicing translators as they are not easily available We
nevertheless believe that the results of this
study provide significant support to the proof-
of-concept of third-generation TM systems
where NLP processing is expected to improve
performance of operational TM systems
In future work we wish to incorporate
alignment so that on the target side what is
retrieved is not the original target segment but
the corresponding clause as in its current
state our method would only be able to
retrieve the original target segment given that
we perform clause splitting only on the source
side It would also be desirable to implement a
working TM tool that incorporates clause
splitting and examine to what extent these help
a translator working on an actual translation
project as the final test of the usefulness of the
methods employed is how they actually
increase the productivity of translators in
terms of time saved
Acknowledgements
We would like to acknowledge the members of
the Research Group in Computational
Linguistics at University of Wolverhampton for
their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner
Part of the research reported in this paper is
supported by the People Programme (Marie
Curie Actions) of the European Unions
Framework Programme (FP72007-2013)
under REA grant agreement no 317471
References
Cohn T Callison-Burch C and Lapata M
(2008) Constructing Corpora for the
Development and Evaluation of paraphrase
systems Computational Linguistics 34(4)
597-614
Hodaacutesz G and Pohl G (2005) MetaMorpho
TM A Linguistically Enriched Translation
Memory In Proceedings of the International
Conference on n Recent Advances in Natural
Language Processing (RANLP-05)
Borovets Bulgaria
Kriele C (2006) Vergleich der Beiden
Translation-Memory-Systeme TRADOS
und SIMILIS Diploma thesis Saarbruumlcken
Saarland University [unpublished]
Lagoudaki E (2006) Translation Memories
Survey 2006 Usersrsquo Perceptions around
TM Use In proceedings of the ASLIB
International Conference Translating amp the
Computer (Vol 28 No 1 pp 1-29)
Macken L (2009) In Search of the
Recurrent Units of Translation In
Evaluation of Translation Technology ed
by Daelemans Walter and Veacuteronique
Hoste 195-212 Brussels Academic and Scientific Publishers
Macklovitch E and Russell G (2000)
Whatrsquos been Forgotten in Translation
Memory In Envisioning machine
translation in the information future (pp
137-146) Springer Berlin Heidelberg
Marsic G (2011) Temporal Processing of
News Annotation of Temporal Expressions
Verbal Events and Temporal Relations [PhD
thesis University of Wolverhampton]
Pekar V and Mitkov R (2007) New
Generation Translation Memory Content-
Sensitive Matching In Proceedings of the
40th Anniversary Congress of the Swiss
Association of Translators Terminologists
and Interpreters 29-30 September 2006
Bern
Planas E (2005) SIMILIS - Second
generation TM software In Proceedings of
the 27th International Conference
on Translating and the Computer (TC27)
London UK
22
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
Puscasu G (2004) A Multilingual Method
for Clause Splitting In Proceedings of the
7th Annual Colloquium for the UK Special
Interest Group for Computational
Linguistics
Reinke U (2013) State of the Art in
Translation Memory Technology
Translation Computation Corpora
Cognition 3(1)
23
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015
Improving translation memory fuzzy matching by paraphrasing
Konstantinos Chatzitheodorou
School of Italian Language and Literature
Aristotle University of Thessaloniki
University Campus
54124 Thessaloniki Greece
chatzikitlauthgr
Abstract
Computer-assisted translation (CAT)
tools have become the major language
technology to support and facilitate the
translation process Those kind of pro-
grams store previously translated source
texts and their equivalent target texts in a
database and retrieve related segments
during the translation of new texts How-
ever most of them are based on string or
word edit distance not allowing retrieving
of matches that are similar In this paper
we present an innovative approach to
match sentences having different words
but the same meaning We use NooJ to
create paraphrases of Support Verb Con-
structions (SVC) of all source translation
units to expand the fuzzy matching capa-
bilities when searching in the translation
memory (TM) Our first results for the
EN-IT language pair show consistent and
significant improvements in matching
over state-of-the-art CAT systems across
different text domains
1 Introduction
The demand of professional translation services
has been increased over the last few years and it is
forecast to continue to grow for the foreseeable
future Researchers to support this increasing
have been proposed and implemented new com-
puter-based tools and methodologies that assist
the translation process The idea behind the com-
puter-assisted software is that a translator should
benefit as much as possible from reusing transla-
tions that have been human translated in the past
The first thoughts can be traced back to the 1960s
when the European Coal and Steel Community
proposed the development of a memory system
that retrieves terms and their equivalent contexts
from earlier translations stored in its memory by
the sentences whose lexical items are close to the
lexical items of the sentence to be translated (Kay
1980)
Since then TM systems have become indispen-
sable tools for professional translators who work
mostly with content that is highly repetitive such
as technical documentation games and software
localization etc TM systems typically exploit not
only exact matches between segments from the
document to be translated with segments from
previous translations but also approximate
matches (often referred to as fuzzy matches)
(Biccedilici and Dymetman 2008) As concept this
technique might be more useful for a translator be-
cause all the previous human translations become
a starting point of the new translation Further-
more the whole process is speeded up and the
translation quality is more consistent and effi-
cient
The fuzzy match level refers to all the neces-
sary corrections made by a professional transla-
tion in order to make the retrieved suggestion to
meet all the standards of the translation process
This effort is typically less than translating the
sentence from scratch To help the translator
CAT tools suggest or highlight all the differences
or similarities between the sentences penaltizing
as well the match percent in some cases However
given the perplexity of a natural language for
similar but not identical sentences the fuzzy
matching level sometimes is too low and therefore
the translator is confused
This paper presents a framework that improves
the fuzzy match of similar but not identical sen-
tences The idea behind this model is that Y2
which is the translation of Y1 can be the equiva-
lent of X1 given that X1 has the same meaning
24
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
with Y1 We use NooJ to create equivalent para-
phrases of the source texts to improve as much as
possible the translation fuzzy match level given
that they share the same meaning but not the same
lexical items In addition to this we investigate
the following questions (1) is the productivity of
the translators improved (2) are SVC widespread
to merit the effort to tackle them These questions
are answered using human centralized evalua-
tions
The rest of the paper is organized as follows
Section 2 discusses the past related work section
3 the theoretical background section 4 the con-
ceptual background as well the architecture of the
framework Section 5 details the experimental re-
sults and section 5 the plans for further work
2 Related work
There has been some work to improve the transla-
tion memory matching and retrieval of translation
units when working with CAT tools (Koehn and
Senellart 2010 He at al 2010a Zhechev and van
Genabith 2010 Wang et al 2013) Such works
aim to improve the machine translation (MT) con-
fidence measures to better predict the human ef-
fort in order to obtain a quality estimation that has
the potential to replace the fuzzy match score in
the TM In addition to this these techniques have
an effect only in improvement of the MT raw out-
put and not in improvement of fuzzy matching
A common methodology that gives priority to
the human translations is to search first for
matches in the project TM When no such close
match is found in the TM the sentence is ma-
chine-translated (He at al 2010a 2010b) In a
somewhat similar spirit other hybrid methodolo-
gies combine techniques at a sub-sentential level
Most of them use as much as possible human
translations for a given sentence and the un-
matched lexical items are machine translated in
the target language using a MT system (Smith and
Clark 2009 Koehn and Senellart 2010 He at al
2010a Zhechev and van Genabith 2010 Wang et
al 2013) Towards the improving of the quality
of the MT output researchers have been using dif-
ferent MT approaches (statistical rule-based or
example-based) trained either on generic or in-do-
main corpora Another innovative idea has been
proposed by Dong et al (2014) In their work
they use a lattice representation of possible trans-
lations in a monolingual target language corpus to
find the potential candidate translations
On the other hand various researchers have fo-
cused on semantics or syntactic techniques to-
wards improving the fuzzy matching scores in TM
but the evaluations they performed were shallow
and most of the time limited to subjective evalua-
tion by authors Thus this makes it hard to judge
how much a semantically informed TM matching
system can benefit a professional translator
Planas and Furuse (1999) propose approaches that
use lemma and parts of speech along with surface
form comparison In addition to this syntactic an-
notation Hodaacutesz and Pohl (2005) also include
noun phrase (NP) detection (automatic or human)
and alignment in the matching process Pekar and
Mitkov (2007) presented an approach based on
syntax driven syntactic analysis Their result is a
generalized form after syntactic lexico-syntactic
and lexical generalization
Another interested approach similar to ours
has been proposed by Gupta and Orasan (2014)
In their work they generate additional segments
based on the paraphrases in a database while
matching Their approach is based on greedy ap-
proximation and dynamic programming given
that a particular phrase can be paraphrased in sev-
eral ways and there can be several possible
phrases in a segment which can be paraphrased It
is an innovative technique however paraphrasing
lexical or phrasal units in not always safe and in
some cases it can confuse rather than help the
translator In addition to this a paraphrase data-
base is required for each language
Even if the experimental results show signifi-
cant improvements in terms of quality and
productivity the hypotheses are produced by a
machine using unsupervised methods and there-
fore the post-editing effort might be higher com-
paring to human translation hypotheses To the
best of our knowledge there is no similar work in
literature because our approach does not use any
MT techniques given that target side of the TM
remains ldquoas isrdquo To improve the fuzzy matching
we paraphrase the source translation units of the
TM so that a higher fuzzy match will be identified
for sentences sharing the same meaning There-
fore the professional translator is given a human
translated segment that is the paraphrase of the
sentence to be translated This ensures that no out-
of-domain lexical items or no machine translation
errors will appear in the hypotheses making the
post-editing process trivial
25
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
3 Theoretical background
There are several implementations of the fuzzy
match estimation during the translation process
and commercial products typically do not disclose
the exact algorithm they use (Koehn 2010) How-
ever most of them are based on the word andor
character edit distance (Levenshtein distance)
(Levenshtein 1966) ie the total number of dele-
tions insertions and substitutions in order the two
sentences become identical (Hirschberg 1997)
For instance the word-based string edit dis-
tance between sentence (1) and (2) is 70 (1 sub-
stitution and 3 deletions for 13 words) and the
character-based string edit distance is 76 (14 de-
letions for 60 characters) without counting
whitespaces based on Koehnrsquos (2010) formula for
fuzzy matching This is a low score and many
translators may decide not to use it and therefore
not to gain from it
(1)
(2)
(3)
(4)
Press Cancel to make the cancellation of your personal
information
Press Cancel to cancel your personal information
Premere Cancel per cancellare i propri dati personali
Press Cancel to cancel your booking information
In this case according to methodologies pro-
posed by researchers of this field this sentence
will be sent for machine translation given the low
fuzzy match score and then it should be post-ed-
ited Otherwise the translator should translate it
from scratch However this is not always safe
given that in many cases post-editing MT output
requires more time than translating from scratch
Observing the differences between sentences
(1) and (2) one can easily conclude that they share
the same meaning although they donrsquot share the
same lexical items This happens because of their
syntax In more detail sentence (1) contains a
SVC while sentence (2) contains its nominaliza-
tion An EN-IT professional translator can benefit
from our approach by accepting the sentence (3)
as the equivalent translation of the sentence (1)
SVCs like make a cancellation are verb-noun
complexes which occur in many languages Form
a syntactic and semantic point of view they act in
the same way as multi-word units Their meaning
is mainly reflected by the predicate noun while
the support verb is often semantically reduced
The support verb contributes little content to its
sentence the main meaning resides with the pred-
icate noun (Barreiro 2008)
SVCs include common verbs like give have
make take etc Those types of complexes can be
paraphrased with a full verb maintaining the same
meaning While support verbs are similar to aux-
iliary verbs regarding their meaning contribution
to the clauses in which they appear support verbs
fail the diagnostics that identify auxiliary verbs
and are therefore distinct from auxiliaries (Butt
2003)
SVCs challenge theories of compositionality
because the lexical items that form such construc-
tions do not together qualify as constituents alt-
hough the word combinations do qualify as cate-
nae The distinction of a SVC from other complex
predicates or arbitrary verb-noun combinations is
not an easy task especially because their syntax
that is not always fixed Except of some cases
they appear with direct object (eg to make atten-
tion) or with direct object (eg to make a reserva-
tion) (Athayde 2001)
Our approach paraphrases SVCs found in the
source translation units of a TM in order to in-
crease the fuzzy matching between sentences hav-
ing the same meaning It is a safe technique be-
cause the whole process has no effect on the target
side of the TM translation units Hence the trans-
lators benefit only from human translation hy-
potheses that usually are linguistically correct
In our example an EN-IT translator will re-
ceive an exact match during his performance
when translating the sentence (1) given the Eng-
lish sentence (2) and its Italian equivalent (sen-
tence (3)) that is included in the TM In addition
to this in case of translating the sentence (4) the
fuzzy match score would be around 90 (1 sub-
stitution for 10 words) comparing to 61 with no-
paraphrase (2 substitution and 3 deletions for 13
words) Other than fuzzy match according to Bar-
reiro (2008) machine-translation of SVCs is hard
so the expected output from the machine will not
be good enough In our example ldquocancelrdquo can be
either a verb or noun
4 Conceptual background
As already discussed paraphrasing a SVC can in-
crease the fuzzy match level during the translation
process This section details the pipeline of mod-
ules towards the paraphrase of the TM source
translation units
41 NooJ
The main component of our framework is NooJ
(Silberztein 2003) NooJ is a linguistic develop-
ment environment used to construct large-cover-
age formalized descriptions of natural languages
and apply them to large corpora in real time The
26
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
module consists of a very large lexicon along
with a large set of local grammars to recognize
named entities as well as unknown words word
sequences etc These resources have been ob-
tained from OpenLogos an old open source rule-
based MT system (Scott and Barreiro 2009) In
NooJ an electronic dictionary contains the lem-
mas with a set of information such as the cate-
gorypart-of-speech (eg V for verbs A for adjec-
tives etc) one or more inflectional andor deriva-
tional paradigms (eg how to conjugate verbs
how to lemmatize or nominalize them etc) one or
more syntactic properties (eg +transitiv for tran-
sitive verbs or +PREPin etc) one or more seman-
tic properties (eg distributional classes such as
+Human domain classes such as +Politics) and
finally one or more equivalent translations
(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-
trates typical dictionary entries
artistN+FLX=TABLE+Hum
cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc
tableN+FLX=TABLE+Conc
manN+FLX=MAN+Hum
Figure 1 Dictionary entries in NooJ for nouns
42 Paraphrasing the source translation
units
The generation of the TM that contains the para-
phrased translation units is straightforward The
architecture of the process which is summarized
in Figure 2 is performed in three pipelines
Figure 2 Pipeline of the paraphrase framework
The first pipeline includes the extraction of the
source translation units of a given TM The target
translation units are protected so that they will not
be parsed by the framework This step also in-
cludes the tokenization process Tokenization of
the English data is done using Berkeley Tokenizer
(Petrov et al 2006) The same tool is also used
for the de-tokenization process in the last step
Then all the source translation units pass
through NooJ to identify the SVCs using the local
grammar of Figure 3 To do so NooJ first pre-pro-
cesses and analyses the text based on specific dic-
tionaries and grammars attached in the module
This is a crucial step because if the text is not cor-
rectly analyzed the local grammar will not iden-
tify all the potential SVCs and therefore there will
not be any gain in terms of fuzzy matching Once
the text is analyzed all the possible SVCs are
identified and hence paraphrased
Figure 3 Local grammar for identification and para-
phrasing of SVCs
In more detail the local grammar checks for a
support verb followed by a determiner adjective
or adverb (optionally) a nominalization and op-
tionally by a preposition and generates the verbal
paraphrases in the same tense and person as the
source We should notice that this graph recog-
nizes and paraphrases only SVCs in simple pre-
sent indicative tense However our NooJ module
contains grammars created for the all the other
grammatical tenses and moods that follow the
same structure The elements in red colors charac-
terize the variables as verb and predicate nouns
The elements lt$V=$V+PR+1+sgt and
$N_PR+1+s represent lexical constraints that are
displayed in the output such as specification of
the support verb that belongs to a specific SVC
These particular elements refer to the first person
singular of the simple present tense The predicate
noun is identified mapped to its deriver and dis-
played as a full verb while the other elements of
the sentence are eliminated The final output of
NooJ is a sentence that contains the paraphrase in-
stead of the SVCs were applicable
The last pipeline contains the de-tokenization
as well as the concatenation of the paraphrased
27
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
translation units in the original TM if any The
paraphrased translation units have the same pro-
prietaries tags etc as the original units
This TM should be imported and used in the
same way as before in all CAT tools As of now
our approach can be applied only to TMs that have
the English language as source As mentioned ear-
lier there is no limit for the target language given
that we apply our approach only to the source lan-
guage translation units
5 Experimental results
The aim of this research is to provide translators
with fuzzy match scores higher than before in case
the TM contains a translation unit which has the
same meaning with the sentence to be translated
Given that there is no automatic evaluation for this
purpose we formulate this as a ranking problem
In this work we analyze a set of 100 sentences
from automotive domain and 100 from IT domain
to measure the difference of the fuzzy match
scores between our approach (parTM) and the
conversional translation process where a plain
TM is used (plTM) This test set was selected
manually in order to contain SVCs in order to en-
sure that each sentence contains at least one SVC
Our method has been applied to a TM which
contained 1025 EN-IT translation units Our mod-
ule recognized 587 SVCs so the generated TM
(parTM) was contained 1612 translation units
(1025 original + 587 paraphrases) The TM con-
tains translations that have been taken from a
larger TM based on the degree of fuzzy match that
at least meets the minimum threshold of 50 To
create the analysis report logs we used Trados Stu-
dio 20141
The results of both analyses are given in Table
1
Our paraphrased TM attains state-of-the-art
performance on increasing the fuzzy match lever-
aging It is interesting to note that the highest
gains are achieved in the low fuzzy categories
(0-74) However we achieve extremely high
numbers in other categories Our approach im-
proves the scores by 17 in 100 match cate-
gory 5 in category 95 - 99 6 in category
85 - 94 28 in category 75 - 84 and fi-
nally 27 in category 0-74 (No match +
50-74) This is a clear indication that para-
phrasing of SVCs significantly improves the re-
trieval results and hence the productivity
1 httpwwwsdlcomcxclanguagetranslation-
productivitytrados-studio
Fuzzy match
category
plTM parTM
100 14 48
95 - 99 23 32
85 - 94 18 29
75 - 84 51 38
50 - 74 32 18
No Match 62 35
Total 200 200
Table 1 Statistics for experimental data
To check the quality of the retrieved segments
human judgment was carried by professional
translators The test set consist of retrieved seg-
ments with fuzzy match score gt=85 (108 seg-
ments) The motivation for this evaluation is two-
fold Firstly to show how much impact paraphras-
ing of SVCs has in terms of retrieval and secondly
to see the translation quality of those segments
which the fuzzy match score is improved because
of the paraphrasing process
According to translators paraphrasing helps
and speeds up the translation process Moreover
the fact that the target segments remain ldquoas isrdquo en-
courage them to use it without a second thought
Figure 4 shows two cases where translators se-
lected to use segments from the parTM We can
see that paraphrasing not only helps to increase
the retrieving but also ensures that the proposed
translation is a human translation so no errors will
appear and less post editing is required in case of
not equal to 100
While there are some drops in terms of fuzzy
match improvement our system presents few
weaknesses Most of them regard the out-of-vo-
cabulary words during the analysis process by
NooJ Although our NooJ module contains a very
large lexicon along with a very large set of local
grammars to recognize and paraphrase SVCs a
few translation units (6 segments) were not para-
phrased In addition to this 2 segments were par-
aphrased incorrectly This happens because they
contain either out-of-vocabulary words or due to
their syntax complexity This is one of our ap-
proachrsquos weaknesses that will be addressed for fu-
ture projects Seg
TMsl TMtg
Make sure that the brake hose is not twisted
Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-
torcigliato
28
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
parTMsl
Make sure that the brake hose is not twisted
Seg
TMsl
TMtg
parTMsl
CAUTION You must make the istallation of the
version 6 of the software
CAUTION You must install the version 6 of the
software
ATTENZIONE Si deve installare la versione 6 del software
CAUTION You must make the istallation of the version 6 of the software
Figure 4 Accepted translations
6 Conclusion
In this paper we have presented a method that im-
proves the fuzzy match of similar but not identi-
cal sentences We use NooJ to create equivalent
paraphrases of the source texts to improve as
much as possible the translation fuzzy match level
given that the meaning is the same but they donrsquot
share the same lexical items
The hybridization strategy implemented has al-
ready been evaluated with different experiments
translators text types and language pairs which
showed that it is very effective The results show
that for all fuzzy-match ranges our approach per-
forms markedly better than the plain TM for dif-
ferent fuzzy match levels especially for low fuzzy
match categories In addition to this the transla-
torsrsquo satisfaction and trust is abundant comparing
to MT approaches
In the future we will continue to explore ways
paraphrasing of other support verbs and other sup-
port languages as well Last but not least a para-
phrase framework to the target sentence may im-
prove even more the quality of translations
References
Athayde M F 2001 Construccedilotildees com verbo-suporte
(funktionsverbgefuumlge) do portuguecircs e do alematildeo In
Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-
tudos Germaniacutesticos n 1 Coimbra Portugal Uni-
versidade de Coimbra
Barreiro A 2008 Make it simple with paraphrases
Automated paraphrasing for authoring aids and
machine translation Ph D thesis Universidate do
Porto
Biccedilici E and Dymetman M 2008 Dynamic Transla-
tion Memory Using Statistical Machine Transla-
tion to improve Translation Memory Fuzzy
Matches In Proceedings of the 9th International
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2008) LNCS
Haifa Israel February 2008
But M 2003 The Light Verb Jungle In Harvard Work-
ing Papers in Linguistics ed G Aygen C Bowern
and C Quinn 1ndash49 Volume 9 Papers from the
GSASDudley House workshop on light verbs
Dong M Cheng Y Liu Y Xu J Sun M Izuha T
and Hao J 2014 Query lattice for translation re-
trieval In COLING
Gupta R and Orasan C 2014 Incorporating Para-
phrasing in Translation Memory Matching and Re-
trieval In Proceedings of the 17th Annual Confer-
ence of European Association for Machine Transla-
tion
He Y Ma Y van Genabith J and Way A 2010a
Bridging SMT and TM with translation recommen-
dation In ACL
He Y Ma Y Way A and Van Genabith J 2010b
Integrating n-best SMT outputs into a TM system In
COLING
Hirschberg Daniel S 1997 Serial computations of Le-
venshtein distances Pattern matching algorithms
Oxford University Press Oxford
Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-
guistically enriched translation memory In In inter-
national workshop modern approaches in transla-
tion technologies
Kay M 1980 The proper place of men and machines
in language translation Palo Alto CA Xerox Palo
Alto Research Center October 1980 21pp
Koehn P and Senellart J 2010 Convergence of trans-
lation memory and statistical machine translation
In AMTA
Levenshtein Vladimir I 1966 Binary codes capable of
correcting deletions insertions and reversals In
Soviet physics doklady volume 10 pages 707ndash710
Pekar V amp Mitkov R 2007 New Generation Trans-
lation Memory Content-Sensivite Matching In Pro-
ceedings of the 40th anniversary congress of the
swiss association of translators terminologists and
interpreters
Petrov S Leon B Romain T and Dan K 2006
Learning accurate compact and interpretable tree
annotation In Proceedings of the COLING ACL
pages 433ndash440
Planas E amp Furuse O 1999 Formalizing Translation
Memories In Proceedings of the 7th machine trans-
lation summit (pp 331ndash339)
Scott B Barreiro A 2009 Openlogos MT and the SAL
representation language In Proceedings of the first
international workshop on freeopen-source rule-
based machine translation Alacant pp 19ndash26
29
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet
Smith J and Clark S 2009 Ebmt for SMT A new
EBMT-SMT hybrid In EBMT
Wang K Zong C and Su K-Y 2014 Dynamically
integrating cross-domain translation memory into
phrase-based machine translation during decoding
In COLING
Zhechev V and van Genabith J 2010 Seeding statis-
tical machine translation with translation memory
output through tree-based structural alignment In
SSST
30
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015
Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods
Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic
xbaisahalesxmedved1fimunicz
Abstract
Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data
In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments
1 Introduction
Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs
Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new
input documents to be translatedObviously the aim is to have a TM that best fits
to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers
We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments
The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent
Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory
2 Related work
Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively
TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments
31
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1
Figure 1 An example of generated subsegments ndash consistent phrases
within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have
attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages
In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case
(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus
3 Subsegment generation
In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used
The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default
Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example
The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations
The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion
4 Subsegment combination
The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document
bull JOIN new segments are built by concatenat-ing two segments from SUB output is J
1 JOINO joint subsegments overlap in asegment from the document output=OJ
2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ
1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords
32
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
Table 1 An example of the SUBSTITUTEO method Czechrarr English
original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)
new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories
bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S
1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS
2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS
During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2
The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries
The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment
2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)
5 Evaluation
For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs
For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)
33
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046
95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258
any 1028 1204 2779 4656 4962
Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028
95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447
all 1158 2932 4666 4987
For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT
Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057
95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162
all 1440 3804 5904 6151
We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-
3httpkilgraycomproductsmemoq
Table 5 METEOR 100 matchescompany in-house translation memory
feature SUB OJ NJ NSprec 060 063 070 066
recall 067 074 074 071F1 064 068 072 068
METEOR 031 037 038 038DGT
prec 076 093 091 081recall 078 086 088 085
F1 077 089 089 083METEOR 040 050 051 045
Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute
tvarreference The data area may have an arbi-
trary shapegenerated seg Area data may have any shape
count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6
34
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation
6 Conclusion
We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage
We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project
Acknowledgments
This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S
ReferencesMichael Denkowski and Alon Lavie 2014 Meteor
universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation
Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28
Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics
Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics
Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics
Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics
Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000
Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004
Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51
Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339
Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics
Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339
Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226
Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing
Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet
35
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015
CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23
Mihaela Vela2 Josef van Genabith23
Jadavpur University India1
Saarland University Germany2
German Research Center for Artificial Intelligence (DFKI) Germany3
tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde
AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset
1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation
The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows
it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations
Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)
In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in
1wwwmatecatcom2The tool will be released as a freeware open-
source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog
36
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data
2 Related Work
CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper
Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)
Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-
ments and mismatched portions are translatedby an SMT system to fill in the gaps
Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work
3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity
One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets
3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad
4Work in progress
37
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited
31 Finding Similarity
For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences
Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively
5httpwwwcsumdedu snovertercom
Input we would like a table by the window
TM Match we want to have a table near thewindow
TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0
we want to have a table near the window | D S S | | S | | |we - would like a table by the window
Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below
bull Test segment how much does it cost
bull TM segment 1 how much does it costto the holiday inn by taxi
bull TM segment 2 how much
If each edit operation is assigned an equalweight according to TER score TM segment
38
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above
32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch
The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM
bull English we want to have a table nearthe window
bull Bengali আমরা জানালার কােছ একটা েটিবল চাই
bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )
GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations
Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog
Input you gave me wrong number
Source Matches
1 you gave me the wrong change i paideighty dollars
2 i think you rsquove got the wrong number
3 you are wrong
4 you pay me
39
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
5 you rsquore overcharging me
Target Matches
1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )
2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )
3 আপিন ভল (Gloss apni vul )
4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )
5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )
For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence
33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index
We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword
We only consider those TM source sentencesfor similarity measurement which contain one
or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different
34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool
4 Conclusions and Future Work
This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software
We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators
We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing
We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations
40
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
Figure 1 Screenshot with color cording scheme
We believe these weights can be used to opti-mize system performance
Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent
AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work
English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India
Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471
ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-
namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer
Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140
Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France
JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244
Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)
Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug
Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems
Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity
Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140
Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona
41
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT
Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT
Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630
Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry
Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31
Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii
Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111
Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204
Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51
Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16
Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA
Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127
Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009
Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium
Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331
Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)
42
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43
Author Index
Baisa Viacutet 31Barbu Eduard 9
Chatzitheodoroou Konstantinos 24
Horak Ales 31
Medvedrsquo Marek 31Mitkov Ruslan 17
Naskar Sudip Kumar 36Nayek Tapas 36
Pal Santanu 36Parra Escartiacuten Carla 1
Raisa Timonera Katerina 17
van Genabith Josef 36Vela Mihaela 36
Zampieri Marcos 36
43