Catalan daily goes Catalan
LocWord 2012, A4Magí Camps (La Vanguardia)Blanca Vidal (Lucy Software)
[1] Introduction, background
79.239
45.309
31.762
15.6626.779
0
10.000
20.000
30.000
40.000
50.000
60.000
70.000
80.000
90.000
Newspapers in CatalanNet Circulation
Source: Estudi General de Mitjans (EGM), 2012
Introduction, background
Increase +4% of copies+7% of readers
Distribution57% Spanish43% Catalan
Results
Introduction, background
Why a Catalan version?Celebration of LV’s 130 anniversaryNormalization of the use of Catalan
Investment to face the crisisOpportunity to consolidate LV’s hegemony
[2] Customer goals
To publish two language editions of the same
newspaper daily (supplements incl.).
Journalists should be able to write in
any of the two languages.
Neither quality nor distribution
timeframes should be affected.
• Tailor-made system• Complying with LV’s style guide• Seamless integration into journalist’s
workflow• Translation of Hermes XML and
InDesign formats• Reliability, high availability• High performance
MT
Customer requirements
[3] Ramp-up phaseProject set-up
Work areas MT linguistic improvement/tuning Post-editing preparationMT system set-up and integrationMT lexicon training
Duration 8 months (+ 3 months)
Staff LV: 10-12 in-house journalistsLucy: 3 computational linguists / lexicographers 1 software developerIncyta: 2 professional post-editors
Important! On-site support
SubphasesTASKS Phase 1 Phase 2 Phase 3 Phase 4
Linguistic improvement/tuning
- Language-type definition x
- Creation of a corpus of real texts x x x x
- Analysis of the translation quality x x x x
- Error reporting (lexicon and grammar errors) x x x x
- Linguistic implementation (lex and grammar) x x x x
- Pre and post-editing filters x x x x
Post-editing preparation
- Gathering of MT post-editing guidelines x
- Evaluation of post-editing effort x x
- Creation and training of the post-editing team x
Technical set-up
- System set-up and integration x
- Preparation of XML converters x
Maintenance
- Lexicon maintenance training x
Duration 2 mo 3 mo 3 mo 3 mo
[a] Linguistic tuning
Language model
Corpus
Translation quality (TQ)
Analysis and error-reporting
Implementation
Accomplished improvement data
Linguistic tuning
Catalan language model• no exclusion• compliant with standards• innovative in terminology• dynamic in syntactical structures
Corpus• ES: 500,000 transl. units – 8,300,000
words• CA: 250,000 transl. units – 3,000,000
words
Conclusions• No specific domains (except Sports)• Culture: proper names• Opinion: idioms, plays on words• Errors not repetitive• % style to be post-edited
Linguistic tuningTranslation Quality
Minimal post-
editing24%
Perfect74%
Medium post-edit
2%
Linguistic tuning
Analysis and error reporting• Semi-automatic detection of missing words• Terminology lists• New and different translations, error
reporting
Implementation• Proper names [44.5 % of the TUs ]• Idioms• Alternatives
Linguistic tuning
Accomplished improvement data• Work in figures
40,000 lexicon entries (20,000 for each transl. direction)Around 440 grammar rulesAround 7,200 words in the proper names files (each transl. dir)
• Non-measurable workUnderstanding of the MT systemUnderstanding of the newspaper specificitiesSupport in the style guide taking into account MT
• ImprovementES>CA 41% diff => 35% better , 4% similar, 2% worseCA>ES 36% diff => 32% better, 3% similar, 1% worse
[b] Post-editing
Post-editing
Metrics on translation volume
Metrics on post-editing effortSpecificities of the
text Post-editors workspace
Post-editing resources Error reporting
process and tools
Post-editing team and profile
Post-editing: metrics
FileTotal
translation unitsLex/gram
post-edition %Style
post-edition %
LV_2010-10-27 2,474 464 18.79% 394 15.96%
Conclusions
• Different sections had different levels of post-editing• What style corrections could be avoided?• Post-editing speed: 1,000-1,500 words/h• Daily volume: 75,000 words• New post-editing team: 20 post-editors/12 editors
(= 42.512 words)
Post-editing: resources, workspace
Post-editors should have proficiency in their skills BUT also
Be trained on MT post-ed
Have an integrated workspace
Have resources at a click
Post-editing guide
Adapt CMS to new workflow
Resources on Intranet language
portal
Classified frequent MT
errors
Reference document for
training
New processing
status
New mark-ups
Bilingual style guide
Links to all reference
dictionaries
MT portal for any journalist
Post-editing: resources, workspace
La Vanguardia’s intranet: linguistic portal
Post-editing: error reporting, team
Error reporting
• Crucial for continuous improvement• Not automated (yet)• Provide better support to error reporting
Definition of post-editing profile and team
• Proficient in Catalan• Journalist background
[c] System integration
During phase 1: pre-production• Pre-production set-up and installation• Hermes XML converter• Changes in the LT engine to translate
InDesign files
During phase 3: production• Production installation• Test (load, performance and stress)• Performance 500-1,200 w/sec• Definition of the final installation size
System integration
• Production: balanced high performance (HP) and high availability (HA) configuration• System requirements: normal Windows Server -> low HW footprint (e.g. Dual Core/Quad 2.5-3 GHz, 2-4 GB RAM running Win Server 2003/2008)
MaintenancePre-production
HermesInDesign
Language portal
Production
InDesignHermes
Web Service Web Service
[4] Operation: production process
Staff• 20 post-editors• 12 editors
Effort• 30’ linguistic review• 10’ journalistic review• 70,000 words/day + suppl.
Timeline• Start 5 p.m.• First edition 11.30 p.m. • Second edition 2.30 a.m.
Operation: production process
Challenge accomplished!
[5] Next goals
Success! Yes.Thanks to • Close work and
cooperation• Three parties involved• Time and effort
investment• Customisation
Next!• How to reduce post-
editing effort• How to re-use post-
edited text
Thank you for your attention
Magí CampsLa [email protected]
Blanca VidalLucy Software Ibé[email protected]
Ignasi [email protected]