1 Elliott Macklovitch Université de Montréal, Canada LREC 2006 – Genoa, Italy TransType2 : The Last Word

1

Elliott Macklovitch

Université de Montréal, Canada

LREC 2006 – Genoa, Italy

TransType2 :

The Last Word

2

What is TransType?

• a novel kind of interactive MT, in which– the user and the system collaborate to draft a

target translation (vs. SL disambiguation)– system’s contributions are completions to the

prefix typed by the user (generated by SMT)– the user is in control of the translation process,

i.e. can always ignore system’s predictions– the system must adapt its predictions to each

new character entered by the user

3

What was TransType2? • an international research project (2002 - 2005),

involving:– 3 university research labs: RWTH (Germany), ITI

(Spain), RALI (Canada)– 2 industrial partners: XRCE (France) & Atos Origin

(Spain)– 2 translation firms, representing end-users: Société

Gamma (Canada) & Celer Soluciones (Spain)

• funded by EC’s FP5 in Europe; federal & Quebec governments in Canada

• applied research: ultimate aim was to provide a practical solution to growing need for HQ transl.

4

5

Target-text mediated IMT

• an intriguing idea…but will it work?– it should (in theory), because each accepted

completion reduces number of keystrokes

– but the user has to evaluate the proposed completions, and this takes time …

need for user trials, involving real translators– TT2 included quarterly trials at the two

translation firms, from month 18 until month 36

6

Two types of evaluation in TT2

• internal technical evaluations– employ automatic metrics, e.g. BLEU, WER

• usability evaluations (5 rounds)

– measure TT’s impact on users’ productivity– ease (or difficulty) with which end-users adapt

to the system– channel for feedback to developers

7

Protocol for in-situ user trials

• corpus: 1 million words of Xerox manuals– available in project’s four languages– partitioned into training, development, test– Xerox terminology glossary; PDF original

• 3 TRs at each agency; Eng. >> Fr. & Sp.– 10 consecutive half-day working sessions– 1st devoted to training, 2nd to ‘dry-run’– baseline comparison: translating within TT2

editor, but with prediction engine off

8

Protocol (cont’d)

• Quality assurance: all translations reviewed by a non-participating reviser (ER4)– principally, for errors of form– productivity gains not at the expense of quality

• use of TT-Player: – reads a detailed trace file that records and times

every interaction between user & system– can play back the session, like a VCR– generates detailed statistics

9

TT-Player (in replay mode)

10

C-TR1 C-TR2 C-TR3 G-TR4 G-TR5 G-TR6

ER3: dry-run (words/hour) 984 432 864 786

ER3: average on 3-4 texts (w/h) 918 774 882 576

ER4: dry-run (w/h) 781 1030 772 518 1081 825

ER4: average on 8 texts (w/h) 1017 1410 725 707 1531 1279

% increase in productivity +30.22 +36.89 -6.08 +36.48 +41.62 +55.03

ER5: dry-run1 (w/h) 924 858 654 864 1338


% increase in productivity +14.29 +28.67 +12.54 +27.78 -20.63

ER5: dry-run2 (w/h) 1602 1416 816 1548 1350

% increase in productivity -34.08 -22.03 -9.80 -28.68 -21.33

ER5: average on 2 dry-runs (w/h) 1290 1137 735 1206 1344

% increase in productivity -18.1 -2.9 0.0 -8.4 -20.9

Productivity results

11

Results of ER 3 & 4

• ER3: three of four participants exceeded DR productivity on at least one text

• ER4: five of six TR’s exceeded their DR rate on 7/8 texts translated with completions– increases quite substantial, from 30-55%– concomitant reduction in effort: target text

produced with ½ no. keystrokes & mouse clicks

• revisers found no more errors in texts produced with TT2 than on DR texts gains not achieved at expense of quality!

12

Problems with ER4 protocol

• scheduling dry-run as 1st session in round – as trial progresses, a gradual improvement in TR

productivity can be observed (‘learning curve effect’)– dry-run first may unduly favour the system

• high degree of full-sentence overlap between test corpus and training corpus (41%)– no error or oversight in selecting test corpus;

rather, a characteristic of this kind of manual– nevertheless, we decided to reanalyze the trace

files, separating repeated from non-repeated sentences and calculating new statistics for each

13

Repetitions in ER4 test corpus

• general correlation between TR productivity and level of full-sentence repetition– counting only novel sentences, increase in the

average productivity of 6 TRs was ~20% over their dry-run productivity

– including repeated sentences, overall increase in productivity was about 32%

– the fact that TT can handle external repetitions correctly is definitely a plus

14

Protocol for ER5

• test corpus drawn from new Xerox manuals– of a type similar to those used for ER4– verified that test corpus contained no repeated

sentences wrt. training corpus

• 2nd dry-run session added at end of round– to counter the argument that a single dry-run in

the first session unduly favoured the system

15

ER5 Productivity results




ER4: dry-run (w/h) 781 1030 772 518 1081 825



ER5: dry-run1 (w/h) 924 858 654 864 1338



ER5: dry-run2 (w/h) 1602 1416 816 1548 1350




16

ER5 Productivity results




ER4: dry-run (w/h) 781 1030 772 518 1081 825



ER5: dry-run1 (w/h) 924 858 654 864 1338



ER5: dry-run2 (w/h) 1602 1416 816 1548 1350




17

Results of ER5 (cont’d.)

• ER5 productivity compared to 2 dry-runs :– average productivity of 4/5 participants > DR1– but productivity on DR2 very high– using TT’s predictions, only 1/5 participants

surpassed combined DR1+DR2 productivity

• text selected for DR2 particular in having:– very short average sentence length & highest

rate of internal repetition– significantly easier to translate than other

chapters

18

ER5 – Productivity per text

0

5

10

15

20

25

30

35

Word

s p

er m

inute C-TR1

C-TR2

C-TR3

G-TR4

G-TR5

GTR6

19

Non-quantitative trial results

• validated the general evaluation approach– for a CAT tool, production time remains the

best measure of the system’s assistance– in-situ trials that replicate normal working

conditions are indispensable– reliance on trace file for accurate measurements

and honest indication of users’ preferences

• Lessons for evaluation methodology– need to take ‘learning curve effect’ into account– need to assess difficulty of test texts

20

Users’ attitude to TT2

• concerted effort made to gather and analyse users’ comments & suggestions– pop-up notepad added to TT2 GUI

• users resented having to make the same modifications to repeated sentences– need to add full-S repetitions processing (TM)

• more generally: “Why can’t the system learn from my corrections?”– on-line adaptive learning represents a difficult

research challenge

21

Conclusions

• Target-text mediated IMT is a novel approach that has much to recommend it :– when engines perform well, users appreciate

the productivity gains it affords and full control of translation quality that it gives them

• Hopefully, TT2 will not be the last word– what needs to be done to improve the system’s

acceptance by professional TRs is quite clear– as demand for HQ translation soars, there

continues to be a real need for new tools to assist TRs and make them more productive

22

For more information on TransType:

• Visit our Web site (on-line demo):

http://rali.iro.umontreal.ca

• Contact me directly:

[email protected]

http://rali.iro.umontreal.ca/

http://rali.iro.umontreal.ca/

Documents

1 Elliott Macklovitch Université de Montréal, Canada LREC 2006 – Genoa, Italy TransType2 : The Last Word