Dr. John TinsleyDr. John Tinsley
CEO IPTranslatorCEO IPTranslator
PIUG Annual Conference 2013PIUG Annual Conference 2013
Alexandria, VA. April 29Alexandria, VA. April 29thth
Understanding Machine Translation and the Challenge of Patents
PIUG Annual Conference, Alexandria, April 29, 2013
The need for translation
Accelerating Global growth in volume of patents:
- 10.7% increase in PCT applications in 2011- China +33.4%- Japan +21%
PIUG Annual Conference, Alexandria, April 29, 2013
Why listen to me?Machine translation is what I do!
- BSc in Computational Linguistics
- PhD in Machine Translation (DCU, CMU)
- Software Engineer for MT (CNGL)
- Founder of IPTranslator
PIUG Annual Conference, Alexandria, April 29, 2013
Machine Translation: The BasicsMachine Translation = automatic translation
The use of computers to translate from one language into another The use of computers to automate some, or all, of the translation
processStatistical Machine Translation (SMT)
An approach to Machine Translation, where translations for an input are estimated based on previous seen translation examples and associated (inferred) probabilities.
e.g. IPTranslator, Google TranslatePrevious approaches:
Rule-based (or transfer-based): based on linguistic rules e.g. Systran; Altavista’s Babelfish
Example-based: based on translation examples and inferred linguistic patterns
SMT is now by far the predominant approach
PIUG Annual Conference, Alexandria, April 29, 2013
Bilingual CorporaA corpus (pl. corpora) is a
collection of texts, in electronic format, in a single language document(s) book(s)
A bilingual corpus is a collection of corresponding texts, in multiple languages A document & its translation A book in multiple languages The European Parliament
proceedings• Note: source language = original language or language we’re translating
from target language = language we’re translating into
a bilingual corpus
PIUG Annual Conference, Alexandria, April 29, 2013
Aligned Bilingual CorporaA document-aligned bilingual corpus corresponds on a document level
For translation, we required sentence-aligned bilingual corpora
The sentence on line 1 in the source language text corresponds to (i.e. is a translation of) the sentence on line 1 in the target language text etc.
Often referred to as parallel aligned corpora
Sentence aligned bilingual parallel corpora are essential for statistical machine translation
PIUG Annual Conference, Alexandria, April 29, 2013
Learning From Previous Translations
Suppose we already know (from a sentence-aligned bilingual corpus) that:- “dog” is translated as “perro”- “I have a cat” is translated as
“Tengo un gato”
We can theoretically translate:- “I have a dog” -> “Tengo un perro”- Even though we have never seen “I
have a dog” before
Statistical machine translation induces information about unseen input, based on previously known translations
- Primarily co-occurrence statistics- Takes contextual information into account
PIUG Annual Conference, Alexandria, April 29, 2013
Statistical Machine Translation
- Example of a small sentence aligned bilingual corpus for English-French
PIUG Annual Conference, Alexandria, April 29, 2013
Statistical Machine Translation
- We take some new input to translate
PIUG Annual Conference, Alexandria, April 29, 2013
Statistical Machine Translation
- We take some new input to translate
- From the corpus we can infer possible target (French) translations for various source (English) words
- We can then select the most probable translations based on simple frequencies (co-occurrence statistics)
PIUG Annual Conference, Alexandria, April 29, 2013
Statistical Machine Translation
Given a previously unseen input sentence, and our collated statistics, we can estimate translation
PIUG Annual Conference, Alexandria, April 29, 2013
Advanced ModellingAll modern approaches are based on building translations for complete
sentences by putting together smaller pieces of translationPrevious example is very simplistic
In reality SMT systems calculate much more complex statistical models over millions of sentence pairs for a pair of languages
Upwards of 2M sentence pairs on average for large-scale systems
Statistics calculated to represent: Word-to-word translation probabilities Phrase-to-phrase translation probabilities Word order probabilities Structural information (i.e. syntactic information) Fluency of the final output
PIUG Annual Conference, Alexandria, April 29, 2013
Data is KeyFor SMT data is key
Information (word/phrase correspondences and associated statistics) is only based on what we have seen before in the data
Important that data used to train SMT systems is: Of sufficient size
avoid sparseness/skewed statistics Representative and relevant
contains the right type of language High-quality
absence of misspellings, incorrect alignments etc. Proofed by human translators
training data
PIUG Annual Conference, Alexandria, April 29, 2013
Why is MT Difficult?
A word or a phrase can have more than one meaning (ambiguity – lexical or structural) E.g.: “bank”, “dive” ; “I saw the man with the
telescope”
People use language creatively New words are cropping up all the time
Linguistic differences between languages E.g. structure of Irish sentences vs. structure of English
sentences: “Tá (Is) ocras (hunger) orm (on me)” <-> “I am hungry”
There can be more than one way to express the same meaning. “New York”, “The Big Apple”, “NYC”
PIUG Annual Conference, Alexandria, April 29, 2013
Why is MT Difficult?
Israeli officials are responsible for airport security.Israel is in charge of the security at this airport.The security work for this airport is the responsibility of the Israel government.Israeli side was in charge of the security of this airport.Israel is responsible for the airport’s security.Israel is responsible for safety work at this airport.Israel presides over the security of the airport.Israel took charge of the airport security.The safety of this airport is taken charge of by Israel.This airport’s security is the responsibility of the Israeli security officials.
PIUG Annual Conference, Alexandria, April 29, 2013
Not all languages are created equalIt’s easier to translate between some language pairs than others
A group of rival companies seek sanctions against Google
Un grupo de compañías rivales pide sanciones contra Google
We believe that the delegates will make their decision after a long debate
Wir glauben dass die Delegierten ihre Entscheidung nach einer langen Debatte treffen
Thank you very much
Go raibh míle maith agat(Lit: May you have a thousand good things)
PIUG Annual Conference, Alexandria, April 29, 2013
The Challenge of PatentsLong Sentences
Complex constructions
L is an organic group selected from -CH2-(OCH2CH2)n-, -CO-NR'-, with R'=H or C1-C4 alkyl group; n=0-8; Y=F, CF3 …
maximum stress of 1.2 to 3.5 N/mm<2> and a maximum elongation of 700 to 1,300% at 0[deg.] C.
PIUG Annual Conference, Alexandria, April 29, 2013
• Authoring guide for “to be translated” text
• Patents break almost all of the rules!
• “Thanks, guys(!)”
The Challenge of Patents
Very long sentences as standardGrammatically incomplete using nominal and telegraphic style (!)Passive forms are frequentFrequent use of subordinate clauses, participles, implicit constructsInconsistent and incorrect spellingHigh use of neologisms Instances of synonymy and polysemy Spurious use of punctuation
PIUG Annual Conference, Alexandria, April 29, 2013
Evaluating Machine Translation QualityAutomatic EvaluationJudge the quality of an MT system by comparing its output against a human-produced “reference” translation-Pros: Quick, cheap, consistent-Cons: Inflexible, cannot be used on ‘new’ input
Human EvaluationAssessment of output by a bilingual evaluator -Pros: Reliable, flexible, multi-faceted (fluency, error analyses, benchmarking)-Cons: Slow, expensive, subjective
Task Based EvaluationFluency vs Adequacy
PIUG Annual Conference, Alexandria, April 29, 2013
Evaluating Machine Translation QualityTask Based Evaluation-Standalone evaluation of MT systems is necessary to get a sense of the overall quality of a system-To determine the ultimate usability of an MT system, intrinsic task-based evaluation is required-Why? Fluency vs. Adequacy
Fluency: how fluent and grammatically correct the translation output isAdequacy: how accurately the translation conveys the meaning of the source
Output 1 The big blue house Output 2 The big house redSource La gran casa roja
PIUG Annual Conference, Alexandria, April 29, 2013
Practical uses of Machine Translation
Understand its limitations and you’ll understand it’s capabilities!
No
•Translate a patent for filing
•Translate literature for publication
•Translate marketing materials
Yes
•Productivity tool for professional translation
•Understand foreign patents
•Localisation processes and “controlled’ content
Thank you!Dr. John [email protected]@IPTranslator
PIUG Annual Conference, Alexandria, April 29, 2013
German Verb Movement
We like that Götze scored a goal in the final.
Uns gefällt, dass Götze ein Tor im Finale geschossen hat(we like that Götze a goal in the final scored has)
PIUG Annual Conference, Alexandria, April 29, 2013
Sentence: 这是一篇有趣的文章Words: 这是 一篇 有趣 的 文章
(zhèshì yīpiān yǒuqù de wénzhāng) (This is an interesting article)
种水果的农民The farmer who grows fruit
[Lit: “grow fruit (particle) farmer”]
PIUG Annual Conference, Alexandria, April 29, 2013
English: “Software”Simplified: 软件Traditional: 軟體
English: “Network”Simplified:网络 Traditional: 網路
Я пошёл в магазин
I went to the shop
В магазин пошёл я
I went to the shop
Пошёл я в магазин
I went to the shop
(A)
(B) (C)