Training Guide PE Certification

Embed Size (px)

Citation preview

  • 8/17/2019 Training Guide PE Certification

    1/79

     

    Training_Guide_PE_CertificationRevision Date: 30/10/2013

    SDL CertificationPost-editing Certification

  • 8/17/2019 Training Guide PE Certification

    2/79

     

    ii

    Table of contents 

    1  Introduction 

    1.1   About this training workbook ...................................................................................................................... 1 

    2   A brief history of post-editing and MT 

    2.1  What is MT? ............................................................................................................................................... 2 

    2.2  MT development in the last century ........................................................................................................... 2 

    2.3   A short history of MT at SDL ...................................................................................................................... 5 

    3  Post-editing versus Translation 

    3.1  Global developments and the localisation industry .................................................................................... 8 

    3.2  Why post-edit? ......................................................................................................................................... 10 

    3.3  Why translate? ......................................................................................................................................... 11 

    4  MT Technologies 

    4.1  The challenges of MT .............................................................................................................................. 12 

    4.2  Rules-based Machine Translation (RBMT) .............................................................................................. 14 

    4.3  Statistical Machine Translation (SMT) ..................................................................................................... 18 

    4.4  Hybrid Systems ........................................................................................................................................ 21 

    5  How the MT output is created 

    5.1  Baselines ................................................................................................................................................. 23 

    5.2  Verticals ................................................................................................................................................... 24 

    5.3  Customisations ........................................................................................................................................ 26 

    5.4  Engine training process ........................................................................................................................... 27 

    6  From the MT output onwards: the basics of post-editing  6.1  Introduction to post-editing ....................................................................................................................... 34 

    6.2  Degrees of post-editing ............................................................................................................................ 35 

    6.3  The quality check process ....................................................................................................................... 38 

    7  How to get the most out of MT 

    7.1  What makes an effective post-editor? ...................................................................................................... 40 

    7.2  Post-editing quality expectations ............................................................................................................. 41 

    7.3  Under-editing ........................................................................................................................................... 43 7.4  Over-editing ............................................................................................................................................. 44 

  • 8/17/2019 Training Guide PE Certification

    3/79

     

    iii

    7.5  Help improve MT for the future ................................................................................................................ 47 

    8  Expected Statistical MT behavior  

    8.1  Common patterns to watch for when post-editing .................................................................................... 50 

    8.2  How to provide feedback to improve the MT output ................................................................................. 52 

    9  Using BeGlobal baselines in SDL Trados Studio 

    9.1  BeGlobal baselines .................................................................................................................................. 59 

    9.2  How to add SDL BeGlobal Community as a translation provider in SDL Trados Studio .......................... 59 

    10  Summary 

    10.1  Conclusion to training workbook .............................................................................................................. 63 

    11  Further references 

    11.1  More information on MT and post-editing ................................................................................................ 65 

    12   Appendix: 

    12.1  Post-editing examples.............................................................................................................................. 67 

  • 8/17/2019 Training Guide PE Certification

    4/79

     

    1

    1  Introduction1.1   About this training workbook

    The scope of this training workbook is to introduce the reader to the techniques and

    skills involved in post-editing machine translation (MT) output. It provides practical

    examples of best-practice post-editing and recurrent issues such as over-editing and

    under-editing. Moreover, it aims to familiarise translators with MT technology in order to

    enable their involvement in the entire process from training engines to post-editing

    content to publishable quality.

    The document covers the following areas:

      The history and development of MT 

      The various MT technologies currently used and the effects they have on the

    quality and post-editability of the MT output 

      The post-editing and quality check processes and their relation to conventional

    human translation    A guide to effectively post-editing MT output to understandable and publishable

    quality 

      Common patterns to watch for when post-editing MT output 

      Using BeGlobal baselines in Studio 

      Where to find further information on MT and post-editing processes 

    In addition, the document aims to address some of the common misconceptions about

    MT:

      MT is taking away my job 

      MT output is always low quality 

      MT material is only useful when it can be easily edited 

      MT does not leave any room for creativity 

      MT does not fit with my translation style 

      MT technology is too complicated   Post-editing is less skilled than translation 

  • 8/17/2019 Training Guide PE Certification

    5/79

     

    2

    2   A brief history of post-editing and MT2.1  What is MT?

    Machine Translation (MT) is automated translation that uses software to translate text

    from one natural language to another. It is one of the oldest applications of Artificial

    Intelligence and both facilitates and accelerates the creation of high quality translations.

    Post-editing MT output can increase productivity in comparison with conventional

    translation. It allows companies to deliver a high quality translation at greater speed,

    and consequently at lower cost, and as such can be considered a new industry “trend”.

    However, it is important to remember that MT does not replace human translators. MT

    is a tool rather than an end solution and a stage of human correction will always be

    necessary when post-editing to a publishable quality. Nonetheless, it is an effective tool

    when understood and used correctly.

    Uses of Machine Translation

    2.2  MT development in the last century

    Following on from the efforts of war-time cryptography, MT is generally considered to

    have started in the 1950s. In 1954, the successful execution of the Georgetown

    • MT is generated by baseline engines orcustomised engines and the output is useddirectly, with no human intervention.

    • This solution is used mostly for content such asemails, support content or instant messages,where the user wants to have an idea of thecontent, without the need for high quality.

    Fully AutomatedUseful Translation

    (FAUT)

    • MT output from customised MT engines ispost-edited by linguists to a quality levelequivalent to conventional translation.

    • Post-editing MT content is the preferredsolution for publishable documents. It is usedas part of a high quality translation process.

    Post-editing

  • 8/17/2019 Training Guide PE Certification

    6/79

     

    3

    Experiment - the fully automated translation of approximately sixty Russian sentences

    into English - ushered in an era of significant funding for MT research in the USA.

    Researchers believed they could produce a fully automated MT system within three tofive years. This endeavour proved more difficult than expected, however, and ten years

    later funding was cut when it became clear that the development of MT had not

    progressed as far as originally hoped.

    Early attempts at MT typically failed because of a lack of coverage. The models

    functioned by encoding a limited selection of transformational rules that simply did not

    provide for the diversity of natural language translation. Consequently, the first attempts

    in the 1970s and 1980s to commercialise MT operated by drastically increasing the

    number of encoded transformational rules. This produced Rules-Based Machine

    Translation (RBMT), which functioned relatively successfully with targeted human

    feedback over a particular domain. However, this led to the further problem of how to

    make the abundance of transformational rules needed to encode language pairs co-

    operate with one another. The answer was a statistical approach to MT.

    In the late 1980s, computational power increased and became less expensive and as a

    result interest picked up in Statistical Machine Translation (SMT). From the 1990s,

    statistical learning approaches came to the fore, led by cutting-edge work from the IBM

    research team. SMT systems no longer required the same human effort to encode

    transformational rules and update lexicons and terminology lists, but rather exploited

    the wealth of existing translations, covering numerous language pairs, to extract rules

    based on statistical probability.

    Since the 1990s, SMT has been pushed forward through intensive research and

    training as well as support from Industry, Defence Advanced Research Projects Agency

    (DARPA) and EC FP7. Statistical MT has been deployed in real-world, commercial

    contexts by Language Weaver (now part of SDL), Google, Microsoft and IBM, and

    there is on-going research into hybrid phrase-based and syntax-based MT. In 2011,

    SMT was boosted with Google's announcement that it would charge for access to the

    Google Translate API. Shortly afterwards, Microsoft also announced that it would start

    charging for use of the Microsoft Translator API. These two events can be viewed as akey milestone for the Machine Translation Industry and the Localisation Industry as a

  • 8/17/2019 Training Guide PE Certification

    7/79

     

    4

    whole. The progression to a paid API model for machine translation is a clear sign that

    both the use and the quality of MT has matured to a level where enterprises and

    developers see sufficient value in MT to invest in it.

     After many decades, it appears that the models used in MT are more in line with our

    understanding of how human language cognition and processing operates. This does

    not mean that MT output is of an equal standard to output produced by the human

    brain. However, we now understand more about what MT can contribute to the

    Localisation Industry and have an invaluable tool for translation that is becoming ever

    more prominent in the field.

    MT accuracy is improving every year and many new techniques are being developed

    and deployed as the field becomes more and more interdisciplinary, drawing from

    computer science, linguistics, probability theory, algorithm design, automata theory and

    engineering.

  • 8/17/2019 Training Guide PE Certification

    8/79

     

    5

    Some facts about MT today

    2.3   A short history of MT at SDL

    SDL first adopted MT into Translation Services in the year 2000 after acquiring a Rules-

    Based Machine Translation (RBMT) engine from Transparent Language, which

    became SDL Enterprise Translation Server (ETS). In 2004, the Knowledge-based

    Translation System (KbTS) Group was set up to use ETS in a high quality translation

    process.

    In 2009, Statistical Machine Translation (SMT) was beginning to establish itself firmly in

    the localisation industry following rapid development. SDL forged a strategic

    partnership with leading SMT developer, Language Weaver, allowing SDL to extend

    the languages supported by MT.

    In 2010, SDL acquired Language Weaver and are continuing to invest heavily in the

    development of SMT technology. SDL rolled out this capability to their Production

    • Of the top 50 global companies, 53% publicallyacknowledge that they use an MT solution

    • 54% of non-Anglophones use MT when visiting Englishlanguage websites

    • 75% of people use free MT tools

    • It is estimated that at least three-quarters of web userstake advantage of free translation tools due to the greateraccessibility and integration of MT solutions.

  • 8/17/2019 Training Guide PE Certification

    9/79

     

    6

    Offices which resulted in a huge increase in scalability and allowed the process to grow

    rapidly. KbTS was re-branded in 2011 to iMT (intelligent Machine Translation) and the

    first post-editing projects were rolled out using SDL Language Weaver SMT.

    Today the SDL iMT department consists of an in-house team of language specialists,

    MT scientistis and project managers, supplemented by trained teams in the Production

    Offices plus a large fully-trained freelance post-editing team. The iMT team are

    responsible for the maintenance of MT engines and for all MT evaluations and

    customisations within SDL Global Solutions. The Project Management team manages

    the set-up of projects, plans and schedules the customisations. The linguists are

    responsible for evaluating the project data for MT suitability based on the content to be

    translated. Once the project is approved for MT, the linguists prepare the data, test the

    results and organise training for the linguistic team in the Production Office as well as

    the freelancers who will work on the project. This approach of preparation, testing and

    training helps guarantee a high quality MT engine and therefore a high quality final

    translation.

     And as for the future, developments within MT are made through improved models and

    algorithms as well as by adding more high quality training data. SDL is constantly

    working on improvements to the machine translation technology so that even better MT

    engines can be created going forwards. The future for MT at SDL is full of possibilities

    and iMT will be on-hand to offer its many years of expertise as the range of MT

    solutions increases.

  • 8/17/2019 Training Guide PE Certification

    10/79

     

    7

    Brief Timeline of MT at SDL

    2000 

     Acquisition of Rules Based Machine Translation (RBMT)

    engine from Transparent Language: SDL EnterpriseTranslation Server (ETS) 

    2004 Knowledge-based Translation System (KbTS) Group set up touse ETS in a high quality translation process 

    2009  Partnership with Language Weaver (LW) 

    2009  Training from LW on how to customise SMT engines 

    2010  Rollout of post-editing process to Production Offices 

    2010  Due diligence and acquisition of Language Weaver  

    2011  Re-branding of KbTS to iMT 

    2011  First iMT projects using SMT 

    2013  Continued development of SMT within SDL

  • 8/17/2019 Training Guide PE Certification

    11/79

     

    8

    3  Post-editing versus Translation3.1  Global developments and the localisation industry

     An increasing number of companies are entering the international market and are

    publishing localised materials in a bid to reach more customers and realise greater

    sales opportunities. This is based on the finding that 85% of consumers feel that having

    pre-purchase information in their own language is a critical factor in buying services.

    IBM estimates that 2.5 quintillion (1018) bytes of data are created every day and that

    90% of corporate data originated in just the last three years. On average, companies

    translate this content into 11 languages. At the same time, strong competition and the

    need for faster turnaround times means that there is an immediate need to lower costs

    and achieve savings through efficient and streamlined technology processes.

    Key trends impacting on global businesses

    Many of the recent trends affecting global business and information management will

    have important consequences for the field of translation in the coming years. By the

    end of 2014 there will be 2 billion users of computers and most of the growth forecast is

    in the upcoming markets. This means that there will be more customers for software

    • Business globalisation

    • Internet use of multiple devices

    • Explosive growth in digital content

    • Effective targeting and revenue capture

    • Growth of translated content

    • Multimedia and video

    • Extreme brand management across all channels

    • Social media and community

  • 8/17/2019 Training Guide PE Certification

    12/79

     

    9

    and appliances and consequently a larger need for translations of user interfaces and

    manuals.

    In addition, by 2014, there will also be 2.5 billion users of the internet, which is 36% of

    the world‟s population, compared with 22% in 2010. Information equivalent to 10 billion

    DVDs will be sent over the internet each month. Not everyone will be able to access the

    information in the language of origin and consequently there will be a larger demand for

    translations in order to make information as widely accessible as possible.

    Furthermore, Cloud Computing has also begun to make an impact in the technologies

    industry. The use of the cloud is growing, and more and more users will needtranslations of the materials and content. The user interfaces will also require

    translation as the number of end-users with different language requirements grows.

    Thus, the demand for translation of both content and the interface itself is steadily

    increasing.

    Finally, social networking tools are rapidly increasing in popularity. The content lacks

    specific structure and often involves interaction between users in various languages.

    Companies are increasingly adopting social networking and professional use will

    ultimately mean that more translations are needed and in a shorter time  – in fact, often

    in real-time, as and when content is created. Again, this will result in a greater need for

    translation.

    In all of the above, the importance of English as a global lingua franca is slowly

    decreasing. Between 2000 and 2010, the two languages with the greatest growth on

    the internet were Arabic and Mandarin Chinese  –  both of which grew twentyfold. In

    contrast, content in English „only‟ tripled. Proportionally, then, English is declining in

    importance relatively quickly. It is estimated that by 2020 English will have lost its status

    as a lingua franca altogether. However, rather than being replaced with another natural

    language, linguistic diversity will be the new status quo and translation will be key to

    communication. In summary, then, there will be an increasing demand for more content

    at greater speed and in an increasing number of languages.

    So the question is, how can MT and post-editing help respond to these trends?

  • 8/17/2019 Training Guide PE Certification

    13/79

     

    10

    3.2  Why post-edit?

    In the last few years, there have been significant developments in MT technology. SDL

    has always been up to date with this development, and uses MT mainly to increase

    efficiency whilst still delivering quality. This is achieved through integration of the MT

    engines with SDL‟s translation environments  –  SDL Trados Studio, TMS and

    WorldServer  –  which results in a streamlined process, leading to faster turnarounds

    and higher cost-effectiveness.

     A growing number of SDL‟s customers and freelance translators now rely on MT for a

    high-quality, integrated translation process. Customised machine translation enginesdeliver output of such good quality that post-editing is faster than translating from

    scratch. Indeed, MT solutions can reduce production times by as much as 50% in some

    cases. As such, many clients consider MT the only viable way to process the enormous

    volume of content they need to localise. Moreover, in certain cases, it allows the client

    to consider translating content that they would not otherwise have tackled as the cost

    would have been prohibitive.

    However, post-editing is not only of value to the client but also has many advantages

    for the translator. SDL‟s intelligent Machine Translation will help freelance translators to

    remain competitive and save time. We combine our SMT technology with project-

    specific Translation Memories to produce translations of post-editable quality that can

    help to increase productivity. Post-editing is not inferior to conventional translation but

    requires all the usual translation skills  –  such as domain knowledge, excellent

    command of the source and target language, proficiency with CAT tools  –  plus a

    willingness to embrace new technological advances.

    The demand for MT solutions is growing quickly and post-editing is rapidly becoming a

    basic skill for translators. Learning how to post-edit will give linguists a foothold in an

    evolving market and open up new freelance possibilities. We have seen a real swing in

    attitudes in the last few years with many clients looking to MT as the default option to

    help deliver translation faster and cheaper – without sacrificing quality.

    In summary, the following client and translator benefits apply:

  • 8/17/2019 Training Guide PE Certification

    14/79

     

    11

    3.3  Why translate?

    Whilst post-editing can provide a number of benefits for clients and translators alike, not

    all projects will be suitable for post-editing. Because MT typically reproduces the

    material used to train the engine, previously unseen material can present difficulties.

    This is particularly common in text types with highly complex sentence structures or

    very specific terminology and texts with a high amount of ambiguity which require

    translations to move away from the source.

     At SDL, all content is evaluated carefully before a project or part of a project is

    considered for MT. Machine Translation technology is improving all the time and

    content types that were not suitable two years ago, are now handled very productively

    using Machine Translation. In some cases, however, conventional translation will still

    be the recommended solution for the foreseeable future. 

    • Lower cost• Faster time to market

    • Publishable quality• Higher volumes for translation•  Ability to handle digital content explosion

    Clientbenefits

    •  A valuable new skill that opens moreopportunities

    • Competitive edge in an evolving market• Greater speed and efficiency

    • Higher volumes compensate for lower post-editing rates

    Translator

    benefits

  • 8/17/2019 Training Guide PE Certification

    15/79

     

    12

    4  MT Technologies4.1  The challenges of MT

    MT shares many of the challenges of human language translation. These include the

    ambiguity and polysemy of natural human language as well as the high levels of

    linguistic diversity between languages. Particularly, where there is variation in the

    morphological or syntactic characteristics of a language it becomes much harder for MT

    to match the source and target phrases. Given that no linguistic information is encoded

    into the statistical model this often presents problems.

    Some of the main issues and active research problems for MT (as well as conventional

    translation) are summarised below:

  • 8/17/2019 Training Guide PE Certification

    16/79

     

    13

    The challenges of MT

    • Domain and genre: vocabulary; style (including active vs. passive)and sentence length will vary accordingly. 

    • Ambiguity: human language is ambiguous on both lexical andsyntactic levels

    • E.g. "bank" can be the financial institution or the edge of a river

    • E.g. "I saw the man with the telescope" - Is it the man or the speakerwho is holding the telescope?

    • Variation in morphology and word order

    • E.g. case and definiteness endings in Hungarian, and Swedish

    • E.g. Verb - Subject - Object order in Arabic and Hebrew

    • No one-to-one translation: a word that covers many social, culturaland linguistic meanings in one language may require finer distinctions

    in another language and vice versa

    • E.g. politeness levels in Japanese

    • E.g. German "Tasse" = English "mug" or English "cup"

    • Idioms: difficult to translate like any other form of formulaic language 

    • E.g. French "Avoir les dent longues" = English "To be ambitious" (Lit:"To have long teeth")

    • Language specific characteristics• E.g. Arabic tokenisation, Chinese word segmentation, etc.

  • 8/17/2019 Training Guide PE Certification

    17/79

     

    14

    4.2  Rules-based Machine Translation (RBMT)

    Chronologically speaking, Rules-Based Machine Translation (RBMT) was the first

    approach to automated translation. It involves parsing a source sentence, analysing the

    structure, converting this to a machine-readable code and then transforming it into the

    target.

    The core system is based on a set of grammatical rules for each of the languages,

    combined with a dictionary. The dictionary contains source words and phrases, their

    translations and detailed grammatical information, such as part of speech and

    inflection. It provides the modules with the linguistic knowledge they need.

    The rules are the “linguistic processor” of the system, responsible for analysis  and

    generation. They use linguistic information stored in the dictionary. These rules are

    intended to represent the grammatical knowledge of speakers and specify inherent

    agreement and relational information.

     At the translation stage, the MT engine analyses each source sentence and tags the

    words and phrases with their part of speech to identify the grammatical components, for

    example, the subject, object and verb. The MT system then looks up the translations of

    these grammatically tagged words and phrases in the machine dictionary and

    combines them using the coded language rules for the target language. This builds the

    translated sentence.

     A large core dictionary provides the translations for everyday words and phrases. For

    translations that use special terminology, an RBMT system can use custom dictionariesin conjunction with the baseline to improve translation accuracy.

    Example

    Determiner and noun need to agree in number and gender

    Subject and finite verb need to agree in number

  • 8/17/2019 Training Guide PE Certification

    18/79

     

    15

    How to recognise RBMT output

    The RBMT output is based on 3 factors:

      Rules for language pair  

      General settings that can be customized (such as quotation marks, verb tense,

    accents, decimal point) 

      The project dictionary where the specific terminology is entered and which is key

    to improve the MT quality. 

    Some common issues can be identified when post-editing rules based machine

    translation. Here we include some examples from English into French, Italian, Spanish,

    Portuguese, Dutch, German, Swedish, and Finnish, which are the most common

    languages for RBMT.

    In order to recognise MT error patterns, post-editors should look out for the following

    potential issues when post-editing.

    Use of superfluous articles

    Superfluous articles are commonly added in most languages, these can also occur

    before proper nouns.

    EN Source: Free High Speed Internet Access!

    IT MT output: l‟ Accesso gratuito a internet ad alta velocità!

    IT Post-edited: Accesso gratuito a internet ad alta velocità!

    EN Source: Oil filter unit: Removal - Refitting

    FR MT output: Bloc filtre à huile : La dépose - la Repose

    FR Post-edited: Bloc filtre à huile : Dépose - Repose

    Use of simple prepositions

  • 8/17/2019 Training Guide PE Certification

    19/79

     

    16

    When a term has not been entered in the Customised Dictionary, simple prepositions

    are used and they should to be corrected when needed.

    EN Source: Reconnect ECT sensor electrical connector.

    FR MT output: Reconnecter le connecteur électrique de capteur ECT 

    FR Post-edited: Reconnecter le connecteur électrique du capteur ECT

    Acronyms automatically translated into terms

    When a specific acronym has not been entered in the Customised Dictionary it is

    automatically and consistently translated into a common term which exists in the Core

    Dictionary.

    EN Source: MR

    IT MT output: Sig.

    DE MT output: Herr

    FR MT output: M.

    Proper nouns translated literally

    EN Source: Thanks to Peter Ferry for reporting the VBScript/Jscript BufferOverrun Vulnerability.

    IT MT Output: Grazie al Traghetto di peter per segnalare la Vulnerabilità legata al

    sovraccarico del buffer di VBScript JScript.

    IT Post-edited: Grazie a Peter Ferry per aver segnalato la vulnerabilità legata alsovraccarico del buffer di VBScript JScript.

    EN Source: He lives in Palm Springs.

    FR MT output: Il habite à Printemps de Paume.

    FR Post-edited: Il habite à Palm Springs.

  • 8/17/2019 Training Guide PE Certification

    20/79

     

    17

    Capitalisation issues

    The MT follows the source capitalisation, unless specific terms have been entered in

    the Customised Dictionary with the required capitalisation (problem especially in IT

    texts, e.g. UI options)

    EN Source: Click Add Custom Phone Tune.

    FR MT output: Cliquez sur Ajoutez l'Air Personnalisé de Téléphone.

    FR Post-edited: Cliquez sur Ajouter une mélodie de téléphone personnalisée.

    EN Source: Select the appropriate option in the Automatic Synchronizationsection

    PT-BR MT output: Selecione a opção apropriada na seção Sincronização Automática 

    PT-BR Post-edited: Selecione a opção apropriada na seção Sincronizaçãoautomática 

    Disambiguation of homographs

    You can encounter what we call “homograph resolution”. This means that the same

    source term can be translated as a noun AND a verb (or an adjective, etc.), for example

    NETWORK (a network, to network/networking).

    When there is a homograph resolution issue, the entire syntax is misanalysed.

    In the following examples the nouns are interpreted as verbs:

    EN Source: Check box D6 on the blue label

    DE MT output: Kasten D6 auf dem blauen Aufkleber prüfen 

    DE Post-edited: Kontrollkästchen D6 auf dem blauen Aufkleber  

    PT Source: The water reservoir does not contain enough water .

    PT MT output: O reservatório de água não contém suficiente aguar .

  • 8/17/2019 Training Guide PE Certification

    21/79

     

    18

    PT Post-edited: O reservatório de água não contém água suficiente.

    Compound formation and hyphenation issues

    For some languages such as German and Finnish compounding rules may work. If they

    do not work, the post-editor must amend accordingly and the term should get encoded.

    RBMT – Pros and Cons

    RBMT allows for excellent terminology control. There is no need for pre-existing TMs

    as project dictionaries can be created from scratch and the output is systematic, rightly

    or wrongly, meaning that experienced post-editors can post-edit quickly and reliably

    with time. However, it can take a number of years to develop a new language pair and

    the source must be well-written to generate good output. Moreover, project dictionaries

    are time-consuming to create and therefore expensive to maintain and output is often

    not very fluent and not sensitive to context, providing a single translation per term.

    4.3  Statistical Machine Translation (SMT)

     A Statistical Machine Translation (SMT) system learns to translate by analysing large

    volumes of previously translated content. The starting point for training an engine is an

    aligned corpus of source and translated sentences of hundreds of millions of words.

    The training process subdivides each of the source sentences into words and series of

    words (n-grams) and analyses the associated translated sentences. In this way the

    training process determines for each n-gram in the source the most likely set of

    •  A lot of control of rules and terminology• Once the grammar is established, new projects can be created

    from scratch relatively quickly

    • Once set up, projects are easy to maintain• Consistent use of terminology

    Pros

    • The grammar is very time-consuming to develop• Rather literal translations• Too context-sensitive

    Cons

  • 8/17/2019 Training Guide PE Certification

    22/79

     

    19

    translations. By analysing just the translated content, the training process learns the

    order in which the translated words are most likely to occur. The more training data and

    the more consistency there is in this data, the more accurate the process becomes.

    In the next stage of the process, the system compiles all of the learned data into the

    runtime MT engine. The runtime MT engine subdivides each sentence into smaller

    chunks and looks up the possible translations in the compiled database. For a given

    source sentence this process results in many possible translated sentences. The MT

    engine uses the statistical data on the probability of a translation and the word order to

    determine the best candidate for the MT output.

    For general purpose translations, the system uses a baseline language engine that is

    trained with a large corpus of broad spectrum content  – hundreds of millions of words.

    To enhance performance for applications that use specific terminology, a SMT system

    can be trained with a corpus that contains only or mostly content that is close to the

    data that is to be translated. An ideal corpus for this is a large Translation Memory (TM)

    that contains the previous translations of a project. The recommended volume of data

    required is 1 to 5 million words, although it is possible to work with less than 1 million.

    This is known as customisation or training.

    The quality of the MT output depends on both the linguistic and technical quality of the

    material included. However compared to RBMT, SMT provides a more fluent translation

    with some context-sensitivity and better reflects the style of the training material.

    SMT – Pros and Cons 

    • Customisation times are quicker than with RBMT• Output reads more fluently and is stylistically better than the output

    from a rules-based system

    •  Able to select the correct translation in certain contexts: e.g.“device” in IT domain 

    • Generally shorter setup times

    Pros

  • 8/17/2019 Training Guide PE Certification

    23/79

     

    20

    Compared with RBMT, Statistical Machine Translation can offer a larger number of

    languages for post-editing as engines are lower cost and faster to train, as well as

    easier to maintain. Moreover, because SMT is trained with “real” sentences and

    phrases the direct output can be more fluent than with RBMT, which is good for raw

    output requirements and additionally helps the post-editor. In addition, there is a high

    level of research activity surrounding SMT and performance improvement is predicted

    for the future. For this reason, SMT is the technology of choice at SDL.

    However, it should nonetheless be noted that SMT requires large amounts of memory

    space and processing capacity  –  though this in itself becomes less of a problem with

    technological developments. Moreover, the output is dependent on the quality and

    volume of data used for the customization, and therefore the post-editor must be aware

    of the range of common trends in order to post-edit accurately. Similarly, it is harder to

    implement changes in terminology made by the client than with RBMT and a project

    specific engine can only be created if there is sufficient data as a starting point.

    Syntax-based SMT – pros and cons

    Syntax-based translation is based on the idea of translating syntactic units, rather than

    single words or strings of words. A Syntax-based statistical engine can improve

    grammatical accuracy and ensure that verbs are realised in the correct position.

    • Need for large bilingual corpora (millions of words)• Difficult to maintain (for retraining a high amount of content is

    needed, which takes time to gather)

    • Need for processing time – file processing times are higher with animpact on hardware costs

    Cons

    • Better modelling of target language structure• Ensures there is always a verb present• Realises the verb in the correct position• Better handling of function words, such as prepositions• Has a more powerful decoding algorithm

    Pros

  • 8/17/2019 Training Guide PE Certification

    24/79

     

    21

    The following table summarises the key differences between SMT and RBMT:

    Attribute SMT RBMT

    Does not need a large

    volume of aligned data fortraining/customisation  +

    Number of languagessupported  +

    Setup time for newlanguage  +

    Terminology control  +

    Software UI term handling +

    Raw fluency +

    Raw accuracy +

    Level of research activity

    and performanceimprovement predicted  +

    4.4  Hybrid Systems

    One thing that is being explored in contemporary research into MT technology is the

    possibility of creating a hybrid engine, where dictionaries, rules and statistical features

    are combined so as to obtain the best of both worlds. This can be done in many

    • Early stages of development

    • Sometimes less accurate terminology as no link to baselineCons

  • 8/17/2019 Training Guide PE Certification

    25/79

     

    22

    different ways; examples are the use of a dictionary to enforce certain translations in

    SMT and the use of statistical techniques to determine the best translation for a

    homograph such as “bank” or “get”, where the translation is different depending on thecontext.

    However, current solutions are fairly pragmatic and leave room for further development

    in future. In some cases, hybrid systems do not back up to a baseline and this can

    exacerbate common MT issues, such as terminology inconsistencies and/or content left

    untranslated.

  • 8/17/2019 Training Guide PE Certification

    26/79

     

    23

    5  How the MT output is createdStatistical MT is now the technology of choice at SDL, so this course will now

    concentrate on SMT technology.

    SDL takes a three-pronged approach to SMT and uses the following different engine

    types, matching the solution to the particular use case:

    5.1  Baselines

    The core MT engines developed by SDL are known as baselines. These baseline

    systems are bilingual corpora used as general databases for each language pair. They

    are based on a large translated corpus of hundreds of millions of words, taken from

    reliable sources available in the public domain, such as news, IT documentation,

    technical manuals and publically-available government material, and distributed across

    various domains, including IT, automotive, news, sports, electronics, etc.

    Baselines are under constant development and new releases are launched frequently.

    Customised engines• Content trained for specific client corpus

    Verticals

    • Domain-specific engines

    Baselines

    • Generic engines containing diverse data

  • 8/17/2019 Training Guide PE Certification

    27/79

     

    24

    This solution produces good results for clients who require immediate access to MT,

    who do not have sufficient volumes of data and/or wish to translate general content

    across several domains.

    Client-specific customisations and domain-specific verticals normally use baseline

    engines as a backup; so if a certain word, phrase, or even grammatical structure is not

    present in the training data, the engine may still be able to produce a translation.

    Baselines  – Pros and Cons 

    5.2  Verticals A vertical is a trained statistical engine that exclusively contains data related to a

    specific subject area, or domain, such as IT, Automotive, Electronics etc. When a client

    does not have enough translated data to be used for a client-specific training, a vertical

    solution can be used instead of a customisation on top of the baseline corpus.

    These domain-specific engines therefore provide a point of entry for projects that have

    small TMs. They also prove useful in those cases where there is not enough time to

    create a project-specific engine before the first jobs start to flow in. Because the vertical

    Pros Cons

  • 8/17/2019 Training Guide PE Certification

    28/79

     

    25

    is a ready-to-use solution, it does not have the development effort involved in creating

    client-specific engines.

    Based on the higher volume of data used in a Vertical when compared to a

    customisation, the engine is less likely to take translations from the baseline and

    therefore less likely to produce a general instead of a more specific technical

    translation. However, as the data for the Vertical will come from different sources within

    a domain it is also more likely to find inconsistencies in style and terminology that will

    need to be checked during the post-editing and quality-checking stages.

    SDL Verticals are available for the following domains in a wide number of languages 

    These engines are always under development and, whenever there is a considerable

    amount of new data and/or new technical features that can enhance the overall

    performance of the engine, they are retrained to improve the overall quality of the MT

    output.

     Automotive Vertical

    Consumer Electronics (CE) Vertical

    HiTech (IT Hardware) Vertical

    Travel Vertical

  • 8/17/2019 Training Guide PE Certification

    29/79

     

    26

    The vertical retraining process is designed to increase productivity when working with

    vertical output. However, if a client prefers a specific translation for a certain term which

    was correct in the original vertical, a retraining might mean that this term could bechanged to a more widely used translation. This will need to be corrected during post-

    editing and we recommend adding terms like this to your QA check.

    Verticals – Pros and Cons

    5.3  Customisations

     A customisation is a trained statistical engine that only (or mainly) contains client-

    specific corpora. It involves preparing client-specific TMs in order to get the best MToutput for production. The recommended requirement for a successful customisation is

    an aligned corpus of 1 million words of relevant customer data, although this may vary

    per project and language pair, and it is possible to create a customisation with lower

    volumes of customer data.

    Using this type of material guarantees adherence to client-specific terminology and

    style.

    Pros Cons

  • 8/17/2019 Training Guide PE Certification

    30/79

     

    27

     As the machine translation output is fully based on the bilingual corpus, with no

    syntactical or lexical data added, the quality of the output can only be as good as the

    quality of the corpus. If the corpus data has inconsistent terminology and/or style, theresulting MT may also be inconsistent. That is why it is important that the linguist

    responsible for the customisation chooses suitable data to be added to the SMT engine

    training.

    Customisation – Pros and Cons

    5.4  Engine training process

    When a project is sent to iMT, all the necessary data is collated  –  including project

    TMs, sample files, project information, etc. The next step in the process is to evaluate

    the source text and establish if it is suitable for machine translation. A source evaluation

    will also allow the linguist to identify any possible issues with the use of MT on the

    project, so that action can be taken during engine creation to try to minimise those

    issues. If the data is suitable, then the TMs are prepared for training the engine. SMT

    engine training is an iterative process, and involves the following steps:

    Pros Cons

  • 8/17/2019 Training Guide PE Certification

    31/79

     

    28

    TM cleaning 

    Data cleaning is a process applied to the training corpus in order to make it compatible

    with the platform where the SMT engines are created. This process improves the

    quality of the data by removing content which could adversely affect the MT output,

    such as tags, entities, misaligned segments, and corruptions. This could appear in the

    output and provoke a drop in productivity. Some parts are also harmonised towards

    achieving MT output that will be faster to post-edit, as less changes will be required.

    Creation of training

    During a customisation, several trainings with different combinations of data may be

    uploaded to the system and then evaluated so the iMT team can select the one that

    delivers the best results. A second trial is based on the results of the first one  –  the

    problems found in the output are traced back to the TM data, which is then manipulated

    further to try to solve the issues. The training with the best results is then deployed for

    production.

    Selection of test sentences 

    For MT testing purposes, the linguist selects a set of sentences which do not appear in

    the corpus which will be uploaded to the SMT system. Ideally, the sentences should be

    taken from new untranslated project files, as this is the best way to reproduce a realtranslation scenario and really test the engine to the max.

    1•  TM cleaning

    2• Selection of test sentences

    3• Testing

  • 8/17/2019 Training Guide PE Certification

    32/79

     

    29

    Testing 

    One of the biggest challenges within the MT industry at this point in time is to find an

    automatic measure that will be able to forecast if a particular MT output will be able to

    reach the particular user‟s goal. Achieving this objective is particularly difficult as there

    are no unique solutions in translation. Many translations may be right for one sentence

    and even more translations can be wrong. Since an automatic assessment of MT

    output quality is generally based on comparing the MT to reference translations, finding

    an automatic procedure to determine the MT output quality is a challenging task where

    a lot of work is currently being concentrated.

    Nowadays, many MT providers choose between human and automatic evaluations (or

    a combination of both).

    Human evaluation is normally centred on Likert-based scales. With this method,

    resources are asked to score aspects of the MT output by following a list of parameters

    associated with a numerical scale. For example, „score 5 if the output is entirely correct,

    score 4 if the output is understandable but has grammatical errors,…‟.This kind of

    assessment mainly focuses on understandability, although some vendors have started

    looking into Likert-based scales that could help assess the post-editing effort. Human

    evaluation can also be used to compare two or more MT engines or systems, and is

    based on the evaluator stating their preference between two or more MT outputs

    generated for the same source sentences.

    Some of the disadvantages inherent with human evaluation are:

      Performing this kind of tests is relatively expensive and time consuming, asseveral resources are required for assessing each and every engine. 

      Human evaluations are prone to subjectivity and final assessments may not be

    consistent after all. 

      Resources need to be familiar with the scales and follow them to the letter in

    order to obtain valid results. 

  • 8/17/2019 Training Guide PE Certification

    33/79

     

    30

    However, when done well, a human evaluation is still often considered to be more

    reliable than automated measures, and has the added advantage of a human translator

    being able to provide useful comments on the issues found on the MT output.

    The productivity increase though is still a difficult factor to predict for all cases, as

    productivity may vary per job and also per resource (it varies with post-editing

    experience, for instance). Most productivity tests in the industry are based on a

    combination of measuring post-editing speed, and post-editing effort, or comparing

    post-editing speed with conventional translation speed.

    In the last decades, many measures for automated evaluation have been proposed.

    Most automated measures assess the quality of the machine translation compared to a

    reference translation which is deemed to be high quality. Some of the most widely

    spread ones are detailed below.

    BLEU (Bilingual Evaluation Understudy) score: this algorithm is meant to evaluate the

    quality of text which has been machine-translated. The central idea behind BLEU is

    “the closer a machine translation is to a professional human translation, the better it is”.For that, scores are calculated for individual translated segments – generally sentences

     – by comparing them with a set of good quality reference translations. Those scores

    are then averaged over the whole corpus to reach an estimate of the translation's

    overall quality. Intelligibility or grammatical correctness are not taken into account

    explicitly, they are supposed to be included in the correct reference translations.

    NIST: the name of this metric comes from the US National Institute of Standards and

    Technology. This measure is based on the BLEU score, but it differs from this algorithm

    in several points.

    Whilst BLEU simply calculates how many n-grams match both in the reference

    translation and in the MT output and gives these n-grams the same weight, NIST also

    calculates how “informative” a particular n-gram is. When a correct n-gram is found, the

    algorithm measures if that combination is a common sequence in the corpus material or

    if, on the other hand, that fragment is not that common in the data. Depending on the

    result, an n-gram will be given more or less weight. To give an example, if the bigram

  • 8/17/2019 Training Guide PE Certification

    34/79

     

    31

    "on the" is correctly matched, it will receive lower weight than the correct matching of

    bigram "interesting calculations", as this is less likely to occur.

    NIST also differs from BLEU in how some penalties are calculated. For example, small

    variations in translation length do not impact the overall NIST score as much as in

    BLEU.

    METEOR (Metric for Evaluation of Translation with Explicit ORdering): this metric was

    designed to address some of the problems found in the more popular BLEU metric, and

    also produce good correlation with human judgment at the sentence or segment level

    (this differs from the BLEU metric in that BLEU seeks correlation at the corpus level).

    For that, several features that had not been part of any other metrics at the time were

    introduced. Matches in METEOR are made by following the parameters below, among

    others:

    Exact words: as with other metrics, a match is made if two words are identical in the

    machine translation output and the reference translation.

    Stem: words are reduced to their stem form. If two words have the same stem, a match

    is also made.

    Synonymy: words are matched if they are synonyms of one another. Words are

    considered synonymous if they share any synonym sets according to an external

    database.

    TER (Translation Edit Rate): this metric measures the number of edits required to

    change a machine translation output into one of the human references.

    Levenshtein distance: this metric measures the similarity or the dissimilarity (“distance”)

    between two text strings by calculating the minimum amount of single-character edits

    (insertion, deletion, substitution) required to change one word into another. In the field

    of machine translation, this can be done by comparing the raw MT output to the human

    translation.

    Let‟s look at a couple of examples:

    http://en.wikipedia.org/wiki/Distancehttp://en.wikipedia.org/wiki/Distancehttp://en.wikipedia.org/wiki/Distancehttp://en.wikipedia.org/wiki/String_(computer_science)http://en.wikipedia.org/wiki/String_(computer_science)http://en.wikipedia.org/wiki/Distance

  • 8/17/2019 Training Guide PE Certification

    35/79

     

    32

    The Levenshtein distance between "sport" and "short" is 1, because 1 edit is required

    to convert one word into the other (replace “p” with “h”).

    The Levenshtein distance between “dog” and “frog” is 2, as it is not possible to convert

    the first word into the second with fewer edits (replace “d” with “f” and add “r”). 

    This algorithm always has a maximum value that corresponds to the maximum length

    of both input strings. In the case that 2 words do not have anything in common, the

    minimum amount of edits will not exceed the maximum amount of characters of the

    longer string.

    Example: if we have “computer” and “alibi”, the Levenshtein distance will be 8 and no

    higher than 8:

    replace “c” with “a” 

    replace “o” with “l” 

    replace “m” with “I” 

    replace “p” with “b” 

    replace “u” with “I” 

    delete “t” 

    delete “e” 

    delete “r” 

     As with other automated measures, the results of the Levenshtein distance are not set

    in stone. As mentioned before, there can be many correct translations for a single

    source; however, the Levenshtein distance will not be able to measure quality on its

    own. Results will vary, for example, if clauses are positioned differently in the MT output

    and in the human reference translation.

    Example:

  • 8/17/2019 Training Guide PE Certification

    36/79

     

    33

    MT: “If I go home after 10pm, I will let you know”.

    Reference human translation: “I will let you know if I go home after 10 pm”.

    In this case, the MT output is correct and no changes would be necessary during a

    post-editing stage. However, the Levenshtein distance will be quite high, as many

    changes would be required to turn the first sentence into the second one.

    That suggests once more the importance of selecting large test beds to run any of

    these automated evaluations on, as that will allow us to get more reliable results.

     Automatic measures also have their limitations: the reference translation is not always

    available, and those measures do not give an indication of post-editing productivity

    expected. Therefore, they are useful for engine training development and comparison,

    but not necessarily practical for a production scenario.

    In January 2011, TAUS began working with a group of its enterprise members with a

    clear objective in mind  –  tackle the general problem of evaluating translation quality. And consequently the idea of the Dynamic Quality Evaluation Framework (DQF) was

    born.

    The framework is still in development, and will allow users to profile their content and

    receive guidance on best-fit evaluation techniques. A knowledge base documenting

    best practices provides detailed practical information on how to carry out seven specific

    types of quality evaluation. By establishing best practices, metrics and benchmarks

    within a dynamic framework, the project team sought to apply best-fit evaluation

    approaches depending on content type and usage, moving away from the dated, static

     – one size fits all – approach used by most vendors.

  • 8/17/2019 Training Guide PE Certification

    37/79

     

    34

    6  Using the MT Output: the basics of post-editing

    6.1  Introduction to post-editing

    Post-editing is a new phase that replaces conventional translation for MT projects. It is

    a change in the process, but the working environment remains the same. The same

    applications and the same reference materials used in a conventional translation

    project are also used when post-editing. Machine-translation is a new component in the

    process that provides human translators more leverage along with the use of TMs.

    Post-editors work on CAT tools editing fuzzy matches from the TM and machine-

    translated segments to a publishable quality.

    Post-editing is a skill which translators develop with time. Post-editors will not be fully

    productive from day one as they need to learn their trade. Industry research has shown

    that experience is the single most important factor in translation productivity andbecomes even more influential in post-editing. Over time, translators can adapt their

    working practices to use the MT output to their advantage.

  • 8/17/2019 Training Guide PE Certification

    38/79

     

    35

    Integrating post-editing into a production environment

    On a file for post-editing, the Translation Memory is applied as usual, to create the

    100% matches and fuzzy matches. Machine translation is applied to any untranslated

    text left after the TM is applied.

    The post-editing phase itself involves a number of key stages. Since the post-editor is

    attempting to be as efficient and productive as possible, preparation is key. Do not rush

    ahead without taking time to consider the source and MT output. Determine the

    useable parts and then build around these. Focus on accuracy, without under- or over-

    editing, and finally check over the grammar and the terminology. Post-editors are

    generally advised that if the text scans well, it will flow well.

    6.2  Degrees of post-editing

    The market makes a distinction between post-editing to publishable quality and post-

    editing to an understandable level. Post-editing to publishable level is the highest

    quality standard. This is in line with the expectations of the majority of SDL‟s clients.

  • 8/17/2019 Training Guide PE Certification

    39/79

     

    36

     After post-editing, files undergo a quality check to ensure that the translation is correct

    and fluent. The final quality should be comparable to conventional translation.

    Post-editing to understandable quality, or light post-editing is normally required for low

    visibility text, or texts that would not otherwise be translated for a client as it would be

    too expensive and time-consuming. A client might decide to opt for understandable

    quality texts in order to reduce the number of support requests for a product or to

    provide an extra service to the user, for example. Typical purposes of understandable

    quality texts include offering users a quick answer on how to fix an issue or providing a

    translation solution for low visibility content, such as FAQs, blogs, and knowledge

    bases.

    When post-editing to an understandable level alone, it is less important to correct style

    and grammar so long as the meaning of a translation is clear. Most important, however,

    is to follow the clear project requirements that should always be provided by the client

    in advance.

    Examples of light post-editing

    LP SOURCE EN MT EN PE COMMENTS

    IT-EN

    Attrezzo dicompressione permisurare lasporgenza dellecanne dei cilindri (dautilizzare con380000364 e piastrespecifiche)

    Tools for compression tomeasure cylinder linerprotrusion ( use with380000364 and specificplates)

    Tool for compression tomeasure cylinder linerprotrusion ( use with380000364 and specificplates)

    The plural needs to beedited because"attrezzo" is singular inthe Italian source, butthere is no need toremove the space afterthe bracket

    IT-EN

    Prima di iniziare

    qualsiasi lavoro inquest'area, spegnereil motore ed estrarrela chiave diaccensione.

    Always stop the engine

    and remove the Keybefore working in thisarea.

    Always stop the engine

    and remove the Keybefore working in thisarea.

    There is no need to

    change the uppercase tolower case

    FR-EN

    Si la valeur souhaitéen’est pas obtenue,

    répéter lesinstructions 3 à 5.

    If the desired pressurehas not been reached,repeat instructions 3 to5.

    If the desired pressurehas not been reached,repeat instructions 3 to5.

    "Required" would bebetter than "desired",but since this is perfectlyunderstandable there isno need to change it.

  • 8/17/2019 Training Guide PE Certification

    40/79

     

    37

    EN-DE

    To remove the 3Ddiffuser:

    Zum Entfernen des 3DRefraktionstechnik:

    Zum Entfernen des 3DRefraktionstechnik:

    The MT has the wrongcase “des” instead of

    “der”. But the MT

    sentence is perfectlyunderstandable as it is.

    EN-FR

    The pressure isreduced to pilotpressure.

    La pression est réduit àla pression pilote.

    La pression est réduit àla pression pilote.

    The gender agreement iswrong, should be“réduite“ instead of“réduit”, but the

    sentence isunderstandable as it isand that does not needto be corrected.

    Publishable quality vs. Understandable level

    Post-editing to publishable quality is covered in mode detail in the next chapters. When

    post-editing to publishable quality, the following rules apply:

    • Most frequent form of post-editing• Generally used for higher visibility texts• Comparable to conventional translation• High quality expectations

    • Follows standard client expectations

    Publishable

    Quality

    • Less frequent form of post-editing• Generally used for lower visibility texts• Focus on meaning not on style and grammar• Expectations based on specific client

    requirements

    • Clear requirements are needed

    Understandable

    Level

  • 8/17/2019 Training Guide PE Certification

    41/79

     

    38

    6.3  The quality check process

    It is recommended that the post-editing process is followed by a quality check, which is

    the equivalent of conventional review.

    1

    •Read the source segment first and then the MT output

    2•Determine the usable elements (single words and phrases) and makethem the basis for your translation

    3

    •Build from the MT output and use every part of the MT output that canspeed up your work

    4

    •Take care not to over-edit (unnecessary rephrasing) or under-edit (wrongprepositions, inflections, compounds, etc.) the MT output. The adjustmentof style (such as “may” versus “might”) can be optional, but grammaticalcorrectness in the target is not

    5

    •Correct any grammatical errors and make sure that the terminology of theMT output is compliant with glossaries and termbases. This will always

    need to be checked as any inconsistencies in the training material will bereproduced in the output

    6•Run the compulsory checks (spelling, grammar, terminology check)

    7

    •Finally, after post-editing each segment, reread your translation and make

    sure that no details are missing and you have not left any words that arenot needed

  • 8/17/2019 Training Guide PE Certification

    42/79

     

    39

     As part of SDL‟s workflow, the quality check is performed as a separate step by a

    reviewer and guarantees that the translation is fully publishable. To achieve this, quality

    at source is key  –  the post-edited file should already be of publishable quality. Tofacilitate this, ensure that the post-editor receives clear instructions and has access to

    all most up-to-date reference materials. The required QA checks need to be run and

    can be used as an indication of the post-editing quality.

    When quality-checking, always bear the MT in mind and understand the initial MT

    output. Identify known problems in advance (see section 8) and make sure to include

    them in your checks (e.g. wrong prepositions, terminology, known issues with MT). It is

    important to learn to distinguish between what needs to be changed and what can

    remain untouched. Note that there are some items which always need to be amended

    by the post-editor. Examples include date formats, spacing, wrong prepositions or

    terminology issues caused by several possible translations of the same word.

    When quality-checking machine-translated material, focus on over-editing and under-

    editing (depending on style and client requirements). Over-editing will lead to lower

    productivity and needs to be avoided during both the PE and the QA check phase.

    Under-editing may result in quality issues and will impact negatively on the time needed

    for quality check.

    Before starting a quality check, make sure that all the content has been translated.

    Then check that the post-edited text reads well from a user„s point of view. The post-

    edited text must match the source. Be careful to look for mistranslations, words left out

    from the translation or additional words which are not on the source text. Check that

    there are no typos. Scrolling down the file will enable you to spot spelling mistakes and

    inconsistencies. Terminology should be consistent with the master glossary, especially

    product names. It is vital that terminology is consistent. Sometimes terminology is not

    consistent in the TMs and there are additional lists and guidelines for terminology.

    Finally, check that style is overall consistent with the rest of the files and complies with

    the style guide from the client.

  • 8/17/2019 Training Guide PE Certification

    43/79

     

    40

    7  How to get the most out of MT7.1  What makes an effective post-editor?

    In order to post-edit effectively, it is essential to use the machine translation output as

    much as possible. Do not ignore the machine translation output and do not translate

    segments from scratch. In almost all cases some parts of the automatic translation

    output can be used and help to speed up work.

    The following guidelines will help you to identify usable parts and achieve the maximum

    post-editing productivity. The translator needs to achieve publishable quality at the

    post-editing stage without sacrificing translation speed. Once you have learnt to identify

    usable parts and to use them, you will find post-editing easier and faster than

    translating from scratch. Like any other new skill, however, there is a learning curve

    with MT post-editing: the more you practice, the faster and easier it gets.

    Post-editing tips

    However, the MT is not only useful when it is easy to edit. You can also use the MT as

    a source of inspiration when looking for the correct translation and pick out bits of the

    sentence to reuse rather than trying to keep as much of the sentence as possible. This

    Do not ignore orerase the MT

    output

    Maximise theusage of the MT

    output

    Use the

    appropriatestyle andterminology

    Follow theproject/client

    style guidelines

    If the MT meetsthe project

    requirements,do not modify it

    Do not spend timeresearching

    terminologyunless the MT is

    clearly wrong

    Do not replacewords withsynonyms

    Do not makealterations for

    the sake ofvariation alone

    If formatting is anissue, restore the

    original sourceformat and paste

    the useful MTparts instead

    An alternative ifthere are manytags is to deletethem, edit the

    text, then insertthe tags again

    At the end, re-readthe segment andcompare it to the

    source foraccuracy 

  • 8/17/2019 Training Guide PE Certification

    44/79

     

    41

    is particularly relevant for longer sentences. Even sentences that are largely incorrect

    can be useful so long as deleting the incorrect material is not time-consuming.

     Apart from this, it is important to bear in mind that account knowledge is important for

    post-editing as well. Whilst this is important for all translation projects  – conventional as

    well as MT  –  a solid knowledge of the project requirements with regard to style

    guidelines, terminology, TM and client expectations will help you achieve good post-

    editing productivity.

    So what makes a good post-editor?

    7.2  Post-editing quality expectations

    The quality expectations will vary according to the degree of post-editing and the client

    requirements. However, certain general principles apply. The aim is to deliver a high

    quality translation faster than a conventional translation. Translation speed is a key

    Excellentlinguistic

    skills

    Domain andsubject

    knowledge

    Proficiencywith CATtools and

    automated

    text-checking

    Positiveattitude

    towards MT

    Practice!

  • 8/17/2019 Training Guide PE Certification

    45/79

     

    42

    factor when post-editing. Therefore, the machine translation needs to be corrected with

    a view to maintaining efficiency.

    There should be no difference in quality between a human translation and a post-edited

    translation when post-editing to publishable quality. However, there may be a slight

    shift in style. Style should be correct and appropriate to the project, but may need to be

    less refined in order to allow for a more efficient use of the MT output. Where a client

    specifically asks for MT to be used on their project, the client needs to be made aware

    of this and expectations need to be managed accordingly.

    There will of course be a certain amount of variation  –  but this is a feature ofconventional translation as well. So long as the quality criteria are adhered to, a post-

    edited text will be considered to have met the quality expectations.

  • 8/17/2019 Training Guide PE Certification

    46/79

     

    43

    Post-editing quality criteria

    There are two main issues that post-editors often face when attempting to fulfil the

    highest possible quality criteria in the shortest amount of time. These are under-editing

    and over-editing and will be discussed in more detail in the following sections.

    7.3  Under-editing

    If a post-editor has under-edited the MT output, they may have missed important errors

    that needed to be corrected and may reflect badly on the quality of the translation.

    Under-editing is generally characterised by the following features:

    • The translation must be a correct reflection of the source.

    • Spelling and punctuation must be correct.

    • The translation must be grammatically and syntactically correct andreflect the conventions of the target language.

    • The correct terminology must be applied and used consistently(including preferred translations for frequently occurring terms).

    • Cultural references (date and time formats, units of measurement,number formats, currency, etc.) must be correctly adapted.

    • The style and register of the target must be appropriate for thedocument type.

    • The original formatting must be reproduced.

    • Project guidelines must be followed.

    • The translation must read well and be suitable for the end user.

  • 8/17/2019 Training Guide PE Certification

    47/79

     

    44

    Below are some examples of under-editing:

    LP Source MT PE Reviewer Comment

    EN-ES

    On its wallsyou'll discoverthe figures of apuma and asnake.

    En sus murallas,descubrirá lacifras de unpuma y unaserpiente.

    En sus murallasdescubrirá lafiguras de unpuma y unaserpiente.

    En sus murallasdescubrirá las figuras de unpuma y unaserpiente.

    The term “cifras” hasbeen correctly post-edited and replaced with“figuras”, but the article“la” has not beenchanged to the pluralform.

    EN-ES

    Inside you cansee a

    sacrificial altarmade of ahuge stone.

    En su interior sepuede ver una 

    altar desacrificios de unaenorme piedra.

    En su interior sepuede ver  una

    altar  de sacrificioshecho con unaenorme piedra.

    En su interior sepuede ver  unaltar  de

    sacrificios hechocon una enormepiedra.

    The preposition “de” hasbeen correctly post-edited, but the article“una” does notcorrespond to the gender

    of the noun “altar” (“una”is feminine whilst “altar”is masculine).

    EN-FR

    How long willthe battery lastusinginteractive

    features (suchas games) onmy phone?

    Combien detemps durel'autonomie àpartir d'interactive

    fonctions (commeles jeux) sur montéléphone ?

    Combien de tempsdure l'autonomiede la batterielorsque j'utilise lesfonctionsinteractives

    (comme les jeux)sur  mon telephone?

    Quelle estl'autonomie de labatterie lorsque

     j'utilise lesfonctionsinteractives

    (comme les jeux)de montelephone ?

    "Combien de tempsdure" should not becombined with the word"autonomie". The litteraltranslation of "How longdoes XXX last" is notappropriate in thiscontext. The correctversion is "Quelle estl'autonomie".

    The preposition "sur" isnot appropriate in thiscontext.

    7.4  Over-editing

    If a post-editor has over-edited the MT output, they may be taking extra time which may

    affect their overall productivity and reduce the benefits of post-editing. Over-editing is

    typically characterised by preferential rather than necessary changes.

    • Errors (spelling, typos)• Mistranslations (target does not match source)

    • Inconsistent terminology• Inaccuracy• Inconsistency in figures, units of measurement,

    etc.

    • Incorrect formatting• Not following project-specific instructions

    Under-editing

  • 8/17/2019 Training Guide PE Certification

    48/79

     

    45

    There is always room to allow stylistic changes and creativity with post-editing, and

    certainly stylistic features that do not meet with the client style guides should be

    amended. The important thing to remember is not to let preferential changes distract

    from necessary amendments and not to let these changes have a negative impact on

    the overall productivity.

    Below are some examples of over-editing:

    LanguagePair Source MT PE with Overediting

    PE withoutOverediting

    Commenton

    overeditedversion

    DE-EN

    Die Kühlungerfolgt durchdas massiveAluminium-Gehäuse unddie seitlichangebrachtenKühlrippen undkommt gänzlichohne Lüfteraus.

    The cooling takesplace through the solidaluminum case and theside-mounted coolingfins and comescompletely withoutfans.

    The cooling finsfitted on the side ofthe solid aluminiumcasing ensure thatthe computer iscooled, as it comescompletely withoutfans.

    Cooling takes place through the solidaluminium casing andthe side-mountedcooling fins - there isno need whatsoeverfor fans.

    Unnecessaryre-orderingand re-translating ofsegments

    DE-EN

    Aber nicht nurÄußerlich hatdiesesFestplattengehä

    use einiges zubieten.

    But not only on theoutside, this hard driveenclosure hassomething to offer.

    This hard drive casinghas more than just agreat design.

    But it's not only on theoutside where thishard drive casing hassomething to offer.

    Overeditedversion isstylisticallymore

    pleasing, butrequires amajorrewrite, whileversionwithoutoverediting isequallycorrect.

    DE-EN

    Fotos mit 1,3Megapixeln

    Photos with 1.3megapixels

    1.3 megapixel photos Photos with 1.3megapixels

    Unnecessaryre-orderingof segments

    DE-EN

    Zudem stehenverschiedenSATA-Typen zurAuswahl, wiez.B. Micro SATAoder Slimline-

    In addition there aredifferent SATA-typesare available, such asmicro SATA or SlimlineSATA.

    There are varioustypes of SATAavailable for this, suchas micro SATA orslimline SATA.

    In addition, there aredifferent SATA typesavailable, such asmicro SATA or slimlineSATA.

    Unnecessaryre-phrasingand changeof syntax.

    • Do not rewrite the translation unlessunavoidable

    • Do not change correct and understandabletranslations, even if they could be phrased morenaturally or fluently

    • If the MT output style meets the projectrequirements, do not change it

    • Reduce changes to a minimum and focus onactual mistakes

    Over-editing

  • 8/17/2019 Training Guide PE Certification

    49/79

     

    46

    SATA.

    DE-EN

    Mit der 1 Meter

    langenTischantennekönnen SieIhren WLAN-Empfangdeutlichoptimieren.

    With the 1 meter long

    Tischantenne you cansignificantly optimize your WLAN-reception.

    You can optimise your

    WLAN receptionsignificantly using the1-m table-topantenna.

    With the 1-m table-top

    antenna you cansignificantly optimise your WLAN reception.

    Unnecessary

    re-orderingof segments;more of theMT can beleftunchanged ifsyntax iskept as is

    EN-DE

    Make sure thatthe brake pedalis depressedwhile youperform thisprocedure.

    Sicherstellen, dass dasBremspedalniedergedrückt wirdwährend Sie diesesVerfahren durchführen.

    Währenddessen mussdas Bremspedalweiterhin gedrücktwerden!

    Das Bremspedal mussniedergedrückt sein,während Sie diesesVerfahrendurchführen.

    Unnecessaryre-write;usable partsof the MTwere ignoredin overeditedversion

    EN-DE

    Install theBluetoothprinter on yourcomputer andset it as thedefault printer.

    Installieren Sie dieBluetooth Drucker aufIhrem Computer, undrichten Sie ihn alsStandarddrucker.

    Installieren Sie denBluetooth-Drucker aufIhrem Computer, undlegen Sie ihn alsStandarddrucker fest.

    Installieren Sie denBluetooth-Drucker aufIhrem Computer, undrichten Sie ihn alsStandarddrucker ein.

    Unnecessaryuse ofsynonyms;verb"einrichten"wasunnecessarilyreplaced by"festlegen"

    EN-DE

    Allow thecomputer tolockautomaticallyafter 10seconds.

    Warten Sie, bis derComputer die Sperreautomatisch nach 10Sekunden.

    Gestatten Sie, dass der Computer nach 10Sekunden automatischgesperrt wird. 

    Warten Sie, bis derComputer die Sperre nach 10 Sekundenautomatisch aktiviert.

    Unnecessaryuse ofsynonyms;verb"warten" wasunnecessarilyreplaced by

    "gestatten";"warten"conveyed thesamemeaning inthis context)

    EN-DE

    When theproximityfeature isenabled butinactive, thefollowingmessagedisplays in theBluetooth

    Device Controlwindow for thephone:

    Wenn der NäheFunktion aktiviert, abernicht aktiv ist, wird diefolgende Meldung in derBluetooth DeviceControl Fenster für dasTelefon:

    Wenn dieNäherungsfunktioneingeschaltet aberinaktiv ist, wird imFenster "Bluetooth-Gerätesteuerung" fürdas Telefon diefolgende Meldungangezeigt:

    Wenn dieNäherungsfunktionaktiviert aber nichtaktiv ist, wird diefolgende Meldung imFenster "Bluetooth-Gerätesteuerung" fürdas Telefon angezeigt:

    Unnecessaryuse ofsynonyms;"eingeschaltet" is synonymto "aktiviert"and "inaktiv"is synonymto "nicht

    aktiv" in thiscontext

    EN-DE

    This featureprovides a quickway to transferfiles withoutrequiring you tobrowse the filesystem on theother device.

    Diese Funktion bieteteine schnelleMöglichkeit, Dateien,ohne die Datei zudurchsuchen auf deranderen Gerät zuübertragen.

    Mithilfe dieserFunktion lassen sich Dateien schnellübertragen, ohne dasDateisystem desanderen Gerätsdurchsuchen zumüssen.

    Diese Funktion bieteteine Möglichkeit,Dateien schnell ohneDurchsuchen desDateisystems desanderen Geräts zuübertragen.

    Unnecessaryre-orderingof segments;more of theMT can beleftunchanged ifthe syntax iskept as is

    EN-FR

    Afterdisconnectingthe high voltage

    terminals,busbars, etc.,insulate the

    Après avoir débranchéles bornes hautetension, jeux, etc.,

    isoler les pièces avec dela bande adhésiveisolante.

    Après ledébranchement desbornes, barres

    collectrices, etc. hautetension, isoler lespièces avec du ruban

    Après avoir débranchéles bornes, barrescollectrices, etc. haute

    tension, isoler lespièces avec du rubanisolant.

    Unnecessarychange ofsyntax

  • 8/17/2019 Training Guide PE Certification

    50/79

     

    47

    parts withinsulating tape.

    isolant.

    EN-FR

    For furtherinformation onthe Table View,see the tutorial"Table ViewProductivityFeatures"

    Pour plus d'informationssur l'affichage entableau, voir lessections du tutoriel"Fonctions deproductivité - Affichageen tableau"

    Pour obtenir de plus

    amplesrenseignements surl’affichage en tableau,voir le tutoriel «Fonctions deproductivité -Affichage en tableau »

    Pour plusd'informations surl’affichage en tableau,voir le tutoriel «Fonctions deproductivité -Affichage en tableau »

    Correct

    expression inMT; notneeding anyediting

    EN-FR

    Alternator isfound to benoisy

    L'alternateur estbruyant

    Le client trouve que l’alternateur estbruyant

    L'alternateur estbruyant

    Correctexpression inMT; notneeding anyediting

    EN-FR

    The oil in thesepassages istrapped and theblade does notmove.

    L'huile dans cespassages est piégée etla lame ne bouge pas.

    La lame ne bouge pascar l'huile de cesconduits est piégée.

    L'huile dans cespassages est piégée etla lame ne bouge pas.

    Unnecessaryrephrasing

    EN-IT

    Be sure that thehydraulic hoseis free ofabrasion.

    Accertarsi che ilflessibile idraulico siaprivo di abrasioni.

    Assicurarsi che ilflessibile idraulico siaprivo di abrasioni.

    Accertarsi che ilflessibile idraulico siaprivo di abrasioni.

    Unnecessaryuse of asynonym.

    EN-IT

    Adjust theangle by raisingthe rear of thevehicle toensure watercovers the

     joints.

    Regolare l'angolosollevando la parteposteriore del veicolo per assicurarsi chel'acqua copre i giunti.

    Sollevando la parteposteriore del veicolo,regolare l'angolo perassicurarsi che l'acquacopra i giunti.

    Regolare l'angolosollevando la parteposteriore del veicolo per assicurarsi chel'acqua copra i giunti.

    Unnecessaryre-orderingof phrases.

    EN-IT

    The only way toallow thedevice tovalidate a self-signedcertificate is toinstall thecertificate onthe device.

    L'unico modo perconsentire ildispositivo per convalidare un certificatoautofirmato perinstallare il certificatosul dispositivo.

    Per permettere aldispositivo di convalidare un certificatoautofirmato, l'unicomodo è quello diinstallare il certificatosul dispositivo.

    L'unico modo perconsentire aldispositivo di convalidare un certificatoautofirmato è quello diinstallare il certificatosul dispositivo.

    Unnecessaryuse ofsynonymsandreordering ofphrases.

    7.5  Help improve MT for the future

    To make it easier to post-edit in the future make sure that you post-edit and translate in

    an MT-friendly way using simple sentence structure and without adding additional

    information or rephrasing the source and complicating the word order in the target

    unnecessarily. This will improve the training material with which engines are retrained.

    For some language combinations, the word order is considerably different between

    source and target and this will always pose problems for MT. However, keeping closer

    to the source is generally the best way forward:

  • 8/17/2019 Training Guide PE Certification

    51/79

     

    48

    In this instance, the second translation has the advantage that the word order in the

    target is closer to the word order in the source. This can help the MT engine to match

    up the words “error ” (German: “Fehler”) and “dash” (German: “Armaturenbrett”) more

    easily with their correct translations.

    If the verb is usually found at the beginning of the sentence in the source and at the

    end of the sentence in the target, adding a lot of additional information in the middle

    can also make it harder for the MT to match up source and target segments correctly.

     As a rule, the MT engine can handle shorter phrases better than long convoluted

    sentences.

     A more MT-friendly style is also achieved by keeping trans