28
www.systransoft.com www.systransoft.com 1 TM Translating Subtitles using Machine Translation Translating Subtitles using Machine Translation Practices, Problems, Methodology Practices, Problems, Methodology Elsa Sklavounou, Ph. D. Elsa Sklavounou, Ph. D. Linguist, Co-funded Projects Technical Coordinator Linguist, Co-funded Projects Technical Coordinator SYSTRAN SYSTRAN

Www.systransoft.com1 1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

Embed Size (px)

Citation preview

Page 1: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 11 TM

Translating Subtitles using Machine TranslationTranslating Subtitles using Machine TranslationPractices, Problems, MethodologyPractices, Problems, Methodology

Elsa Sklavounou, Ph. D.Elsa Sklavounou, Ph. D.Linguist, Co-funded Projects Technical CoordinatorLinguist, Co-funded Projects Technical CoordinatorSYSTRANSYSTRAN

Page 2: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 22 TM

SYSTRAN MT Customization MethodologyOverview

A customization project involves three different customization levels that provide incremental higher translation quality:

Basic Terminology

Complex Terminology

Linguistic Rules  

Page 3: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 33 TM

SYSTRAN MT Customization MethodologyOverview

Basic TerminologyThe first step entails the creation of a User Dictionary that covers most of the noun terminology in the corpus, and various simple adjective and verb terms.

Complex TerminologyThe second level concerns the coding of complex terminological entries; such as the coding of complex verbs with their complements (subject, object…) and their translations.

Linguistic RulesThe third level involves language-specific code modifications in the SYSTRAN linguistic modules.

  

Page 4: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 44 TM

SYSTRAN MT Customization MethodologyLevel 1 & Level 2

Customization level 1 and 2 focuses on the implementation in the systems of specialized terminology from the corpus. Level 1 and 2 tasks include:

Simple and complex terms extraction ;

Simple and complex terms translations ;

Simple and complex terms coding ;

Simple and complex terms review ;

Page 5: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 55 TM

SYSTRAN MT Customization MethodologyLevel 1 & Level 2

Step 1: Corpus installation and analysis

Prerequisite 1: a formatted corpus

Step 2: Term extraction

Simple terms (nouns and noun expressions)

Complex terms (verb patterns)

DNT (Do Not Translate) integration

Page 6: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 66 TM

SYSTRAN MT Customization MethodologyLevel 3

Customization level 3 focuses on the implementation of linguistic rules uniquely adapted to language-specific syntactic and semantic issues found in translations taken from the corpus. Level 3 tasks include: Detailed linguistic evaluations and the development of a comprehensive customization plan:

Implementation of customized rules  Regression tests  Correction of linguistic translation errors  Acceptance testing before release

Page 7: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 77 TM

SYSTRAN MT Customization MethodologyQuality Levels

Estimate of the quality levels that may be achieved for each customization level.

Page 8: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 88 TM

SYSTRAN MT Customization MethodologySoftware Tools

 

The process for coding simple and complex terms and related dictionary maintenance is managed by the SYSTRAN Linguistics Platform that integrates the following two tools, required to complete customization levels 1 and 2.

 

 

Page 9: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 99 TM

SYSTRAN MT Customization MethodologySoftware Tools

  

SYSTRAN Dictionary Manager

The SYSTRAN Dictionary Manager (SDM) enables translators to build and manage multilingual dictionaries. SDM includes preparation steps for dictionary coding tasks, an online dictionary lookup (via an HTML interface), and a compiler for runtime machine translation dictionaries. It is composed of three main components: a database, HTML query form (dictionary lookup, reports, logs, import and export) and a Windows client (interactive coding tool).

 

Page 10: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 1010 TM

SYSTRAN Customization Methodology Software Tools

The SYSTRAN Review Manager (SRM) is a productivity tool used for

the review quality assessment and maintenance of linguistic resources used combined with a SYSTRAN system.

Page 11: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 1111 TM

SYSTRAN Customization MethodologyPrerequisite 1:

a formatted grammatical corpus

Grammar Writing RulesUsing ArticlesAvoiding Speech AmbiguityUsing EnumerationEnsuring Subject-Verb AgreementUsing Prepositions Using Infinitives at the Beginning of Sentences Using Imperatives Observing Punctuation RulesUsing Main ClausesUsing Subordinate ClausesUsing Relative Clauses

Avoiding Multiple StackingUsing Compound Words Using Capitalization Using Spelling VariationsLexical Ambiguities Disambiguation of Product Names and MenusAvoiding Lexical AmbiguitiesUsing CompoundsFormat and Typographical IssuesSegmentation

Page 12: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 1212 TM

SYSTRAN Customization Methodologyfor MUSA

Two-process fully-automatically generated Corpus: Speech Recognition (KU Leuven),Automatic Sentence Compression (CNTS)

First priority

Subtitles Constraints

Second Priority

The least possible ambiguous content

Lesson learned : No prerequisite

Page 13: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 1313 TM

SYSTRAN MT Customization Methodology

Upgraded Software Tools (Client Tools v5)

Page 14: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 1414 TM

SYSTRAN Translation Project Manager Terminology Review

Not Found Words Extraction

Reviewing Terminology and Sentences

The Terminology Review tab in the Review window lets you identify expressions such as Not Found Words or Terminology extracted by the software.

Page 15: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 1515 TM

SYSTRAN Translation Project Manager Terminology Review

Not Found Words Extraction

Examples

SRC_Idthese parents know measles can be dangerous, but they don't want their child to have MMR, the triple vaccine which protects them from measles, mumps and rubella.

Raw MTces parents savent la rougeole peut être dangereuse, mais ils ne veulent pas que leur enfant a MMR, le vaccin triple qui les protège contre la rougeole, les oreillons et la rubéole.

Page 16: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 1616 TM

SYSTRAN Translation Project ManagerAlternative Meanings

Alternative Meanings

shows alternative translations based on different meanings of a source word or expression.

The Alternative Meanings tab in the Review window shows alternative meanings for expressions in SYSTRAN or User Dictionaries

Page 17: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 1717 TM

SYSTRAN Translation Project ManagerAlternative Meanings

Examples

SRC_Id

they'd rather pay for single vaccines at 60 pounds a shot, even though the government insists MMR is safe.

Raw MT

ils payeraient plutôt les vaccins uniques à 60 livres un coup de feu, quoique le gouvernement exige que MMR est sûr.

Customized MT

ils payeraient plutôt les vaccins uniques à 60 livres une injection, quoique le gouvernement exige que MMR est sûr.

Page 18: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 1818 TM

SYSTRAN Dictionary Manager User Dictionaries (UDs)

User Dictionaries (UDs) let you increase the quality of source language analyses, which also increases thetranslation output for all associated target languages. UDs can be used for a number of functions, including:Automatically translating Not Found Words in the SYSTRAN dictionary.Overriding the target-language meaning of a word or expression in the SYSTRAN dictionaries, a capability that lets you customize translation output to fit specific needs.Ensuring that an expression is always treated as a unit by SYSTRAN analysis programs.

Page 19: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 1919 TM

SYSTRAN Dictionary Manager User Dictionaries (UDs)

Metrics

Type of DictionaryENFRENEL

Do Not Translate Words3532 entries (enxx)

Proper Nouns1495 entries (enfr)1495 entries (enel)

MUSA Terminology1443 entries (enfr)5228 entries (enel)   

Page 20: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 2020 TM

SYSTRAN Dictionary Manager User Dictionaries (UDs)

Examples

SRC_IDAndrew Wakefield ignited the debate over MMR by announcing the findings of research into a group with autism and bowel disease.

Raw MTAndrew Wakefield a enflammé la discussion au-dessus de MMR en annonçant les résultats de la recherche dans un groupe avec la maladie d'autism et d'entrailles.

Customized MTAndrew Wakefield a enflammé la discussion au-dessus de MMR en annonçant les résultats de la recherche dans un groupe avec autisme et maladie d'entrailles.

   

Page 21: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 2121 TM

SYSTRAN Translation Project Manager Source Analysis

Interactive Disambiguation

The Source Analysis tab in the Review window shows how the software handled source ambiguities and allows you to override the software selections.

Page 22: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 2222 TM

SYSTRAN Translation Project Manager Source Analysis

Interactive Disambiguation

Examples

ID 523At first we thought it was parts of the building but it was people, literally people falling all around us.Raw MT

D'abord nous avons pensé que ce faisait partie du bâtiment mais c'était les gens, peuplent littéralement la chute tout autour de nous. Customized MTD’abord nous avons pensé que c’etait des fragments du bâtiment, mais c’était des gens, littéralement des gens qui tombaient autour de nous.

Page 23: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 2323 TM

SYSTRAN Dictionary Manager Normalization Dictionaries (NDs)

Normalization Dictionaries (NDs)There are two types of Normalization Dictionaries (NDs): source normalization and target normalization.Source normalization normalizes source document before translation. Target normalization adapts translation output to user needs in term of terminology consistency. It can also provide a way to replace expressions chosen by the software’s translation engine with user-defined expressions.

Page 24: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 2424 TM

SYSTRAN Dictionary Manager Normalization Dictionaries (NDs)

Examples

SRC_IDswe did n't know she had measles but we do. I mean I ca n't help...

Raw MTnous avons fait le n't savons qu'il a eu la rougeole mais nous faisons. Je veux dire l'aide de n't d'I ca…

Customized MT via SRC Normalizationnous n'avons pas su qu'il a eu la rougeole mais nous faisons. Je veux dire que je ne peux pas aider

Page 25: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 2525 TM

SYSTRAN Translation Project Manager Sentence Review

for Translation Memory Construction

The Sentence Review tab in the Review window compares sentences in the source and target. You can then check the sentences you want to send to User Dictionaries, where you can work with them further in order to post-edit them and construct Translation Memories.

Page 26: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 2626 TM

SYSTRAN Dictionary Manager Translation Memories (TMs)

Translation Memory (TM)

A set of translated and validated sentences that can be integrated into the translation process. Translation Memories (TMs) are databases of aligned pre-translated sentences. Unlike Dictionaries, TMentries can be formatted (for example, italic or bold) and are used by the translation engine to performmatches on full sentences in the source document. TMs are not usually created manually, but are built usingSYSTRAN’s Translation Project Export or from TMX files.

Page 27: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 2727 TM

SYSTRAN Dictionary Manager Translation Memories (TMs)

Examples

ID 370Now people kind of started panicking and said we've got to leave no matter what.

Raw MTMaintenant sorte de personnes de panique commencée et dite nous avons pour laisser n'importe ce que. Customized MTLes gens maintenant avaient l’air de paniquer disant qu’ils devaient à tout prix partir.

Page 28: Www.systransoft.com1  1 TM Translating Subtitles using Machine Translation Practices, Problems, Methodology Elsa Sklavounou, Ph. D

www.systransoft.comwww.systransoft.com 2828 TM

SYSTRAN Dictionary Manager Translation Memories (TMs)

Translation Memory Import/Export

Already existent Tmx standard translation memory exchange files can be imported/exported via SYSTRAN Dictionary Manager .