Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
“The More, the Better”in Human and Machine Translation
what do we have in common?
Ekaterina Lapshinova-Koltunski
GermersheimJanuary 29th, 2015
January 29th, 2015 “The More, the Better” 1
Overview
1 Aims and Motivation
2 Related Work
3 Methods and DataDataFeature Selection
4 Analysis and Interpretations
January 29th, 2015 “The More, the Better” 2
Aims and Motivation
Aims and Motivation
January 29th, 2015 “The More, the Better” 3
Aims and Motivation
Translation Phenomenon
January 29th, 2015 “The More, the Better” 4
Aims and Motivation
Translation Phenomenon
January 29th, 2015 “The More, the Better” 5
Aims and Motivation
Translation Phenomenon: Dimensions
January 29th, 2015 “The More, the Better” 6
Aims and Motivation
Translation Phenomenon: Dimensions
⇒ “translation varieties” distinguishednot only according to
1 languages (both source and target)2 registers (text types they belong to)
but also:1 translation methods (human vs. machine)
used in translation process,2 experience (knowledge and data) involved
Assumption: experience relate HT and MT andprovide us with new perspectives on their research.
January 29th, 2015 “The More, the Better” 7
Aims and Motivation
Assumption: Similarities
January 29th, 2015 “The More, the Better” 8
Aims and Motivation
Assumption: Similarities
HUMAN TRANSLATION MACHINE TRANSLATION
amount of human translatorexperience
amount of data used in SMT
similar outcomes in both translation varieties⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓
professional vs. novice trans-lators
systems based on small vs.large training data
learn in a similar way, collecting experience, data or knowledgeHowever: they both differ in the type of the experience acqui-red, which is reflected in their linguistic features
January 29th, 2015 “The More, the Better” 9
Aims and Motivation
Assumption: Similarities
January 29th, 2015 “The More, the Better” 10
Related Work
Related Work
January 29th, 2015 “The More, the Better” 11
Related Work
Human and Machine Translation
studies addressing both human and machine translations:[White, 1994], [Papineni et al., 2002], [Babych et al., 2004],[Popovic and Burchardt, 2011], [Popovic and Ney, 2011]all focus solely on translation error analysis, using humantranslation as a referencestudies operating with linguistically-motivated categories:[Popovic and Burchardt, 2011], [Popovic and Ney, 2011] or[Fishel et al., 2012]However: none of them provides a comprehensive analysis ofspecific linguistically motivated features of different translationmethods
January 29th, 2015 “The More, the Better” 12
Related Work
Human and Machine Translation
works on differentiation between human and machine translation:(1) [Volansky et al., 2011] and (2) [El-Haj et al., 2014]:
(1) analysis of human and machine translations, and comparablenon-translated textsa range of features based on the theory of translationese, see[Gellerstam, 1986]claim that the features specific for human translations can be usedto identify MTcoinciding and diversifying features
(2) compare translation style and consistency in human and machinetranslations of Camus’ novel “The Stranger” (French-English andFrench-Arabic)measure: readability as a proxy for styleevaluative and not descriptive character
January 29th, 2015 “The More, the Better” 13
Related Work
Experience: SMT
the relation between large vs. small data plays an important roleGains from using large corpora for translation/language models:
[Estrella et al., 2007][Bertoldi and Federico, 2009][Koehn and Haddow, 2012]
January 29th, 2015 “The More, the Better” 14
Related Work
Experience: HT
relation between experienced vs. inexperienced translators:[Jääskeläinen, 1997], [Jakobsen, 2002],[Martinez Melis and Hurtado Albir, 2001],[Englund-Dimitrova, 2005], [Muñoz Martín and Conde, 2007],[Carl and Buch-Kromann, 2010]
[Jääskeläinen, 1997]: translational behaviour of professionals andnon-professionals who perform translation from English into Finnish[Carl and Buch-Kromann, 2010]: translation processes for studentand professional translators, relating properties of the translators’eye movements and keystrokes to the quality of the producedtranslations⇒ process rather than the product(no description of linguistic features involved)[Englund-Dimitrova, 2005]: a combined process and productanalysis⇒ only cohesive explicitness
January 29th, 2015 “The More, the Better” 15
Related Work
Experience Combined
some parallelism in the observations for both translation varieties[Göpferich & Jääskeläinen, 2009]:
- with increasing translation competence, translators focus on largertranslation units[Koehn, 2010]:
- in (phrase-based) MT, large training data sets make it possible tolearn longer phrases
experienced translators take into account more aspects, e.g. textfunction, context, registersystems trained with larger data set can also capture morecontextual information
January 29th, 2015 “The More, the Better” 16
Methods and Data
Methods and Data
January 29th, 2015 “The More, the Better” 17
Methods and Data
Questions and Methodology
compare relations between novice translations and statisticalmachine translations with small databetween professional translations and statistical machinetranslations produced with much datatrace the patterns specific for both human and machinetranslation:⇒ define similarities and differences between the varieties
January 29th, 2015 “The More, the Better” 18
Methods and Data
Corpus-Based Approach
REQUIREMENT CHOICE
lexico-grammatical corpus-based approachfeaturescorpus contain human and machine transla-
tionsetting 1 two variants of HTsetting 2 two SMT systemsfeatures frequencies of lexico-grammatical
patternsfeature distributions across translation varieties
January 29th, 2015 “The More, the Better” 19
Methods and Data
Data
January 29th, 2015 “The More, the Better” 20
Methods and Data Data
Corpus
VARTRA cf. Lapshinova (2013)
contains:
variants of translation from English into German= translation varieties produced by:(1) human professional translators (PT1)(2) human inexperienced translators (PT2)(3) a rule-based MT system (RBMT)(4) 2 statistical MT systems (SMT1 and SMT2)
TOTAL number of tokens in translations ca. 600,000
January 29th, 2015 “The More, the Better” 21
Methods and Data Data
Corpus
PT1 – CroCo, [Hansen-Schirra et al., 2012]PT2 – trained translators (over BA) with no/little experienceRBMT – SYSTRANSMT1 – Google Translate (big undefined data)SMT2 – Moses system (small known data)
Each translationcovers 7 registers:
political essays – ESSAYfictional texts– FICTIONinstruction manuals– INSTRpopular-scientific articles– POPSCIletters of share-holders– SHAREprepared political speeches– SPEECHtouristic leaflets – TOU
January 29th, 2015 “The More, the Better” 22
Methods and Data Feature Selection
Features: Patterns
January 29th, 2015 “The More, the Better” 23
Methods and Data Feature Selection
Feature Requirements
reflect linguistic characteristics of all texts under analysis
content-independent (do not contain terminology or keywords)
easy to interpret
January 29th, 2015 “The More, the Better” 24
Methods and Data Feature Selection
Feature Requirements
reflect linguistic characteristics of all texts under analysis
content-independent (do not contain terminology or keywords)
easy to interpret
January 29th, 2015 “The More, the Better” 24
Methods and Data Feature Selection
Feature Requirements
reflect linguistic characteristics of all texts under analysis
content-independent (do not contain terminology or keywords)
easy to interpret
January 29th, 2015 “The More, the Better” 24
Methods and Data Feature Selection
Nature of Features
Data-driven
Theory-driven
January 29th, 2015 “The More, the Better” 25
Methods and Data Feature Selection
Nature of Features
Data-driven Theory-driven
January 29th, 2015 “The More, the Better” 25
Methods and Data Feature Selection
Our Features
1 data-driven:- joint work with Marcos Zampieri (DFKI, Saarbrücken)
2 theory-driven:- in [Lapshinova, forthcoming],- in joint work with Mihaela Vela (UdS)
January 29th, 2015 “The More, the Better” 26
Methods and Data Feature Selection
Theory-Driven Features
patterns register translationese1 content vs. grammatical words mode simplification2 nominal vs. verbal word classes and
phrasesfield normalisation / shining
through3 ung-nominalisation field normalisation / shining
through4 nominal vs. pronominal and demon-
strative vs. personalmode explicitation, normalisati-
on / shining through5 abstract or general nouns vs. all other
nounsfiels explicitation
6 logico-semantic relations: additive,adversative, causal, temporal, modal
mode explicitation
7 modal meanings: obligation, permis-sion, volition
tenor normalisation / shiningthrough
8 evaluative patterns tenor normalisation / shiningthrough
January 29th, 2015 “The More, the Better” 27
Analysis and Interpretations
Analysis and Interpretation
January 29th, 2015 “The More, the Better” 28
Analysis and Interpretations
Hierarchical Cluster Analysis (HCA)
� discover differences and similarities between subcorpora� discover ‘interesting structures’� group according to different dimensions, i.e. variety and register
cf. Baayen (2008); Everitt et al. (2001)set of dissimilarities for the n objects is usedeach subcorpus is assigned to its own clusteriteratively, two most similar clustersstatistics behind hierarchical cluster analysis quantifies how farapart (or similar) two (sets of) subcorpora areon the top node, all clusters are joined together
⇒ subcorpora similar to each other have a common node on the tree
January 29th, 2015 “The More, the Better” 29
Analysis and Interpretations
Hierarchical Cluster Analysis (HCA)
January 29th, 2015 “The More, the Better” 30
Analysis and Interpretations
Statistical test: Correspondance Analysis
Input: frequencies of featuresOutput: a two dimensional graph with:
arrows for the observed feature frequenciespoints for translation varieties
Interpretation:the length of the arrows indicates how pronounced a particularfeature isthe position of the points in relation to the arrows indicates therelative importance of a feature for a register.the arrows pointing in the direction of an axis indicate a highcontribution to the respective dimension
cf. [Baayen, 2008], [Glynn, 2012]
January 29th, 2015 “The More, the Better” 31
Analysis and Interpretations
Correspondence Analysis
January 29th, 2015 “The More, the Better” 32
Analysis and Interpretations
Summary and Discussion
parallelism in HT and MT in terms of experinceinfluence on quality?realise that both differ in the type of experience (or data) acquired⇒ need to be modelled differentlymore datamore featuresother methods?
January 29th, 2015 “The More, the Better” 33
Analysis and Interpretations
Summary and Discussion
January 29th, 2015 “The More, the Better” 34
Thank you!Questions? Comments? Suggestions?
January 29th, 2015 “The More, the Better” 35
Babych, B., Hartley, A., and Sharoff, S. (2004).Modelling legitimate translation variation for automatic evaluation of mt quality.In Proceedings of LREC-2004, volume Vol. 3.
Bertoldi, N. and Federico, M. (2009).Domain adaptation for statistical machine translation with monolingual resources.In Proceedings of the Fourth Workshop on Statistical Machine Translation, StatMT ’09, pages 182–189, Stroudsburg, PA,USA. Association for Computational Linguistics.
Carl, M. and Buch-Kromann, M. (2010).Correlating translation product and translation process data of professional and student translators.In Annual Conference of the European Association for Machine Translation, Saint-Raphaël, France.
El-Haj, M., Rayson, P., and Hall, D. (2014).Language independent evaluation of translation style and consistency: Comparing human and machine translations ofcamus’ novel “the stranger”.
Englund-Dimitrova, B. (2005).Expertise and Explicitation in the Translation Process.John Benjamins, Amsterdam/Philadelphia.
Estrella, P. S., Hamon, O., and Popescu-Belis, A. (2007).How Much Data is Needed for Reliable MT Evaluation? Using Bootstrapping to Study Human and Automatic Metrics.In Proceedings of Machine Translation Summit XI, pages 167–174.ID: unige:3461.
Fishel, M., Sennrich, R., Popovic, M., and Bojar, O. (2012).Terrorcat: a translation error categorization-based mt quality metric.In 7th Workshop on Statistical Machine Translation.
Gellerstam, M. (1986).Translationese in Swedish novels translated from English.In Wollin, L. and Lindquist, H., editors, Translation Studies in Scandinavia, pages 88–95. CWK Gleerup, Lund.
Hansen-Schirra, S., Neumann, S., and Steiner, E. (2012).Cross-linguistic Corpora for the Study of Translations. Insights from the Language Pair English-German.de Gruyter, Berlin, New York.
Jääskeläinen, R. (1997).Tapping the Process: An Explorative Study of the Cognitive and Affective Factors Involved in Translating. Doctoraldissertation.PhD thesis, University of Joensuu, Joensuu.
Jakobsen, A. L. (2002).Translation drafting by professional translators and by translation students.In Hansen, G., editor, Empirical Translation Studies. Process and Product, volume 27 of Copenhagen Studies inLanguage, pages 191–204. Samfundslitteratur, Frederiksberg.
Koehn, P. (2010).Statistical Machine Translation.Cambridge University Press, New York, NY, USA, 1st edition.
Koehn, P. and Haddow, B. (2012).Towards effective use of training data in statistical machine translation.In Proceedings of the Seventh Workshop on Statistical Machine Translation, WMT ’12, pages 317–321, Stroudsburg, PA,USA. Association for Computational Linguistics.
Martinez Melis, N. and Hurtado Albir, A. (2001).Assessment in translation studies: Research needs.journal des traducteurs / Translators’ Journal, 46(2):272–287.
Muñoz Martín, R. and Conde, T. (2007).Effects of serial translation evaluation.In Schmitt, P. A. and Jüngst, H. E., editors, Translationsqualität, pages 428–444. Peter Lang, Frankfurt.
Papineni, K., Roukus, S., Ward, T., and Zhu, W.-J. (2002).BLEU: a method for automatic evaluation of machine translation.
In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318.
Popovic, M. and Burchardt, A. (2011).From human to automatic error classification for machine translation output.In 15th International Conference of the European Association for Machine Translation (EAMT-2011), Leuven, Belgium.European Association for Machine Translation.
Popovic, M. and Ney, H. (2011).Towards automatic error analysis of machine translation output.Computational Linguistics, 37(4):657–688.
Volansky, V., Ordan, N., and Wintner, S. (2011).More human or more translated? original texts vs. human and machine translations.In Proceedings of the 11th Bar-Ilan Symposium on the Foundations of AI With ISCOL (Israeli Seminar on ComputationalLinguistics).
White, J. S. (1994).The ARPA MT evaluation methodologies: Evolution, lessons, and further approaches.In Proceedings of the 1994 Conference of the Association for Machine Translation in the Americas, pages 193–205.
January 29th, 2015 “The More, the Better” 35