11
Multilingual Data Value Chain for CEF Automated Translation: Interoperability Plan CEF.AT workshop Luxembourg, 22 Sept 2015 Dave Lewis, ADAPT Centre [email protected]

Multilingual Data Value Chain for CEF Automated Translation:Interoperability Plan

Embed Size (px)

Citation preview

Multilingual Data Value Chain for CEF Automated Translation:

Interoperability Plan

CEF.AT workshop Luxembourg, 22 Sept 2015 Dave Lewis, ADAPT Centre [email protected]

Translators CEF.AT

Sustainable ML Data Value Chain

Improved Productivity

Low Cost MT

Domain Adapted

Approved Terminology

Language Pairs

Language Resources

TM

Term base

Domain knowledge

Quality Assurance

Language Resources

Discover Check Rights Select & Use

Translation Productivity

Postedit Consume Quality Assure

LR Enrichment

Enrichment Services Validate Annotation Self-build Micro-domains

Value Chain Interoperability

•  Move from Archival Curation to Active Curation •  Open meta-data published at source:

–  W3C Data Catalogue Vocabulary (DCAT) –  Legacy meta-data conversion & validation –  Concretely: Meta-Share Linked Data Mapping

•  http://www.w3.org/community/ld4lt/wiki/Meta-Share_OWL_metamodel

•  Searchable Cataloguing Service: –  Concretely – LingHub:

•  http://linghub.lider-project.eu/

•  Machine readable rights/license –  W3C Open Digital Rights Language

•  https://www.w3.org/community/odrl/

–  Use for Translation IP

LR: Discovery & Useage Rights

•  Linked data from existing format: –  TMX, XLIFF to W3C CSV-on-the-Web to

RDF •  Selection meta-data

–  Provenance (MT or PE) & translation language codes

–  Dereferencable segments for open annotation of terms

•  MT Web Service APIs –  Forced decoding with term translations –  Iterative Re-training API –  MT log data: out of vocabulary & forced term

to inform PE productivity

LR Select & Use

•  Bottom line: did MT make translation more productive?

•  Measure #1: Post-editing effort –  A/B test on total segment post-editing time –  Open Edit Vector format –  iOmegaT- instrumented open source CAT tool –  Edit vector analysis tool - licensable

•  Measure #2: ML Web Site analysis –  A/B test on translated web pages (MT vs PE vs HT) –  Easyling web translation proxy

Translation Productivity

•  Enrich segments with links to open lexical-conceptual resources –  Word Sense Disambiguation, Entity

Linking, Automated Term Extraction •  Babelfy API •  DBPedia Spotlight, TaaS APIs

•  Open validation –  Publish +/-ve validation of enrichment from

translation projects –  In-context validation from project

posteditors and terminologists using TBX status flags

LR Enrichment

•  Goal: Reduce cost of collecting and selecting parallel data •  Agree & Promote DCAT Profile for publishing public

sector parallel text •  Establish suite of common machine-readable licences

(ODRL) •  DCAT and licence meta-data profile for standardised

parallel text format –  XLIFF 2.0 module –  TMX update – new OASIS TC –  CSV on the Web

•  Linghub as basis for public index/search service •  Minimise distance between published parallel text and

meta-data passed along translation value chain

Interoperability Plan: Parallel Text

•  Goal: Make it easy for public bodies to measure impact of MT on their translation processes

•  Agree/Promote Open Edit Vector format – Encourage integration in CAT tools

•  Guidelines on A/B testing, analysis and interpretation

•  Open feedback channels to CEF.AT

Interoperability Plan: Productivity

•  Goal: annotate segments with links to terms and lexical-conceptual resources

•  Agree/promote Open Annotation links –  XLIFF 2.1: inline ITS Terminology and TextAnalysis

attributes or standoff with XLIFF fragment –  Need similar ITS profile and fragment for TMX –  Profile W3C CSV-on-the-Web with Open Annotation

•  Guidelines on deferencing Links to Term-bases or Lexical-Conceptual resources –  W3C Ontolex group

•  Validation workflow and feedback – Trials with FREME, Babelfy, others

Interoperability Plan: LR Enrichment

THANK YOU! [email protected]