21
Being multilingual with EMMA Jorge Civera EMMA Summer School [email protected] Tuesday 7th July, 2015

EMMA Summer School - Jorge Civera - Being multilingual with EMMA

  • Upload
    eumoocs

  • View
    1.270

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EMMA Summer School - Jorge Civera - Being multilingual with EMMA

Being multilingual with EMMA

Jorge Civera

EMMA Summer School

[email protected]

Tuesday 7th July, 2015

Page 2: EMMA Summer School - Jorge Civera - Being multilingual with EMMA

Index

1. Presentation

2. Multilingual access to MOOCs

3. Video subtitling

• Transcription

• Translation

4. Document translation

5. Conclusions and Discussion

UPV - Being multilingual with EMMA 2 / 21

Page 3: EMMA Summer School - Jorge Civera - Being multilingual with EMMA

Presentation

• Lecturer at the Department of Computer Systems and Computation

• Machine Learning and Language Processing (MLLP) group (mllp.upv.es)

• Automatic Speech Recognition:

– Already supported: English (En), Spanish (Es), Italian (It), Dutch (Nl), Estonian (Et),Portuguese (Pt), French (Fr) and Catalan (Ca)

– In progress: German (De) and Slovene (Sl)

• Machine Translation:

– Language pairs available: En→ {Es, It, Fr, Ca} and {Es, It, Nl, Et, Pt, Fr, Ca}→ En

• Speech Synthesis:

– Already supported: English (En) and Spanish (Es)

• Experience on EU projects providing multilingual access to educational content:

– transLectures and EMMA

UPV - Being multilingual with EMMA 3 / 21

Page 4: EMMA Summer School - Jorge Civera - Being multilingual with EMMA

Presentation

• transLectures (Nov 2011 - Oct 2014)

– Lowering language barrier to access video repositories by providing multilingual subtitles

– Improving subtitles by massive adaptation and intelligent interaction

– VideoLectures.NET (VL) and poliMedia (pM) video repositories with thousands of hours

– Source languages: English and Slovene in VL and Spanish in pM

– Target languages: Spanish, French, German, Slovene and English

• EMMA (Feb 2014 - Jul 2016)

– Providing multilingual access to MOOCs (videos and documents)

– Few hours of video in 7 languages: En, Es, It, Nl, Et, Pt and Fr

– Source language is the national language of the MOOC provider

– Target languages: English, Spanish and Italian

UPV - Being multilingual with EMMA 4 / 21

Page 5: EMMA Summer School - Jorge Civera - Being multilingual with EMMA

Multilingual access to MOOCs

• Most MOOCs are offered in few languages

– English (45%), Spanish (32%), French (14%) and other languages (9%)

• Language barrier is keeping millions of potential learners from taking MOOCs

• What components in a MOOC need to be translated?

– Texts

– Images

– Videos

– Conversations (Forums)

• EMMA tackles with translation of texts and videos at the moment

• Videos are translated by providing subtitles in the target language

UPV - Being multilingual with EMMA 5 / 21

Page 6: EMMA Summer School - Jorge Civera - Being multilingual with EMMA

Cost of translating MOOCs

Texts

• Manual translation rate is approximately 2.500 words per day

• A 6-week course with 75.000 words takes 1.5 PM to be translated

Videos

• Before translating, videos are manually transcribed (10 RTF)

• Then, transcriptions are translated into the desired language (30 RTF)

• A course including 2 hours of video takes 0.5 PM to be translated

Solutions to lower costs

• Crowdsourcing (TED talks)

• Speech Recognition and Machine Translation to generate draft translations

– User effort to translate a course is reduced to 30% - 50% (0.6 - 1 PM)

UPV - Being multilingual with EMMA 6 / 21

Page 7: EMMA Summer School - Jorge Civera - Being multilingual with EMMA

Overview of automatic video subtitling

• Step-by-step process:

1. Generation of automatic transcriptions from video

2. Manual review of automatic transcriptions to correct transcription errors

3. Generation of automatic translations from manually reviewed transcription

4. Manual review of automatic translations to generate final subtitles

• State-of-the-art technology cannot provide perfect automatic subtitles

• However, it significantly reduces the effort to generate multilingual subtitles

• User effort saving depends on automatic transcription+translation accuracy

• You can contribute to improve transcription and translation accuracy

UPV - Being multilingual with EMMA 7 / 21

Page 8: EMMA Summer School - Jorge Civera - Being multilingual with EMMA

How to improve transcription accuracy

• Transcription systems learn to transcribe from examples

– At least 50 hours of videos (audio) previously transcribed to learn the acoustic model

– Texts in millions of words to learn the language model

Language Videos (hours) Text (Mwords)Dutch 532 628English 620 464000Estonian 130 410French 88 1800German 36 135Portuguese 54 573Italian 54 868Slovene 27 224Spanish 128 654

• Adaptation of transcription systems to the specific videos is key for high accuracy

– Availability of videos manually transcribed with similar acoustic conditions

– Availability of text resources related to the video in question∗ Title is used to retrieve related documents from Google

∗ Slides contain most of the words uttered by the lecturer

∗ Documents: text content from the course, additional text resources (bibliography)

UPV - Being multilingual with EMMA 8 / 21

Page 9: EMMA Summer School - Jorge Civera - Being multilingual with EMMA

Why automatic transcriptions

• Quality of automatic transcription can be impressive, but it greatly depends on:

– Availability of transcribed videos and related text materials– Sound quality of the video– Complexity of language involved (phonetics and grammar)

• All in all, high-accuracy fully automatic transcription is not possible

• Automatic transcriptions need to be manually reviewed

• Reviewing automatic transcription is much faster than doing it from scratch

• Transcriptions are not only needed to generate automatic translations:

– Non-native speakers and hearing impaired persons

– Text searchability and analysis

– Summarisation

– Video recommendation and relation

• Reviewed transcriptions are important to generate usable draft automatic translations

UPV - Being multilingual with EMMA 9 / 21

Page 10: EMMA Summer School - Jorge Civera - Being multilingual with EMMA

Reviewing automatic transcriptions

• Once a video is ingested into the system, a draft transcription is automatically generated

• Transcribed videos are available for review using a web interface

• Yet another slide and hands on reviewing an automatic transcription

UPV - Being multilingual with EMMA 10 / 21

Page 11: EMMA Summer School - Jorge Civera - Being multilingual with EMMA

Evaluating transcription review process

• Review of automatic transcriptions is evaluated from two viewpoints:

– Transcription accuracy

– Time spent to review automatic transcriptions measured as Real Time Factor (RTF)

Language Accuracy (92%) RTF (10)Spanish Excellent (86%) 3Estonian Good (70%) 3Portuguese Average (57%) 5Italian Good (82%) 5English Good (81%) 6Catalan Good (83%) 6Dutch Good (75%) 6French Good (75%) 6

UPV - Being multilingual with EMMA 11 / 21

Page 12: EMMA Summer School - Jorge Civera - Being multilingual with EMMA

Demo on transcription

1. Overview of the Transcription and Translation Platform (ttp.mllp.upv.es)

2. Uploading a video

3. Reviewing video transcription

4. Reviewing video translation

5. Reviewing document translation

UPV - Being multilingual with EMMA 12 / 21

Page 13: EMMA Summer School - Jorge Civera - Being multilingual with EMMA

How to improve translation accuracy

• Translation systems learn to translate from parallel texts

– Millions of sentences previously translated to learn the translation model

– Texts in millions of words to learn the language model

• Parallel texts are collected from public multilingual organisations (EU, UN, TED, etc.)

• Not all parallel text available is useful to translate your MOOC: need of domain adaptation

Language pairs All (Msents) Selection (Msents)Dutch-English 27.3 1.7English-Spanish 14.0 3.2English-Italian 24.5 6.4English-French 28.8 3.2Estonian-English 10.5 10.5French-English 28.8 0.5Portuguese-English 27.5 6.4Italian-English 24.5 6.4Spanish-English 14.0 6.4

• Adaptation of translation systems to the domain of the MOOC

– Text of the course to be translated

– Domain-related materials previously translated

– Bibliography of the course in the target language

UPV - Being multilingual with EMMA 13 / 21

Page 14: EMMA Summer School - Jorge Civera - Being multilingual with EMMA

Reviewing automatic translations

• Speech Recognition technology is in a more mature stage than Machine Translation

• Machine Translation has improved over the last years, but it is still far from perfect

• Quality of automatic translation depends on:

– Proximity between source and target languages

– Complexity of grammar structures used by the speaker

– How specific the vocabulary employed is

– Availability of parallel texts in the same field

• Evaluation of translation is cumbersome, since there is not a unique correct translation

• Translations need to be manually reviewed before publishing them

• Translation review is faster than generating them from scratch

UPV - Being multilingual with EMMA 14 / 21

Page 15: EMMA Summer School - Jorge Civera - Being multilingual with EMMA

Reviewing automatic video translations

• Reviewed video transcriptions are automaticaly translated into the desired languages

• The same web interface allows you to review source and target subtitles in parallel

• Reviewed subtitles can be exported as SRT files

UPV - Being multilingual with EMMA 15 / 21

Page 16: EMMA Summer School - Jorge Civera - Being multilingual with EMMA

Reviewing automatic document translations

• Text included in the course is ingested into the translation system

• A similar web interface allows you to review source and target texts in parallel

• Preview of source and target texts also available

• Translated text is imported back into the EMMA platform

UPV - Being multilingual with EMMA 16 / 21

Page 17: EMMA Summer School - Jorge Civera - Being multilingual with EMMA

Evaluating translation review process

• Review of translations is evaluated from two viewpoints:

– Translation accuracy automatically computed from single reference translation

– Time spent to review automatic translations (in RTF)

Language pairs Accuracy RTF (30)Spanish → English Good (64%) 7Spanish → Catalan Excellent (73%) 9English → Italian Good (59%) 10Dutch → English Good (52%) 13Italian → English Good (53%) 14Estonian→ English Poor (13%) 16English → Spanish Good (62%) 17French → English Average (22%) 26

UPV - Being multilingual with EMMA 17 / 21

Page 18: EMMA Summer School - Jorge Civera - Being multilingual with EMMA

Demo on translation

1. Overview of the Transcription and Translation Platform

2. Uploading a video

3. Reviewing video transcription

4. Reviewing video translation

5. Reviewing document translation

UPV - Being multilingual with EMMA 18 / 21

Page 19: EMMA Summer School - Jorge Civera - Being multilingual with EMMA

Conclusions and Discussion

• Multilingual access to your course boosts visibility

• The cost of manually translating your course is high (2 PM)

• Automatic translation can reduce the temporal cost up to 30% - 50%

• Accuracy of automatic translation depends on several factors:

– Languages involved

– Availability of annotated data resources related to your course

– Specificity of the course

• Designing a multilingual MOOC should also take into account:

– Slides– Images– Application interfaces (demos)– Bibliography– In general, language-dependent content that is not easy or too costly to edit

UPV - Being multilingual with EMMA 19 / 21

Page 20: EMMA Summer School - Jorge Civera - Being multilingual with EMMA

Thank you for your attention!

UPV - Being multilingual with EMMA 20 / 21

Page 21: EMMA Summer School - Jorge Civera - Being multilingual with EMMA

Comparative results with YouTube/Google

• Comparison with YouTube in terms of Word Error Rate

Word Error RateLanguage EMMA YouTubeDutch 25.7 38.6English 39.2 70.8Italian 28.9 31.6Portuguese 49.8 62.3Spanish 14.4 34.3

• Comparison with Google Translate in terms of BLEU

Quality - BLEULanguage pairs EMMA Google

Dutch → English 41.6 33.4English → Spanish 42.5 39.0Italian → English 46.9 27.9Portuguese→ English 47.6 45.4Spanish → English 28.2 27.6

UPV - Being multilingual with EMMA 21 / 21