TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

sAndrejs Vasiļjevs

chairman of the boardandrejs@tilde.com

Localization World, Santa ClaraOctober 9, 2013

MT & Terminology:

better together

• Language technology developer

• Localization service provider

• Leadership in smaller languages

• Offices in Riga (Latvia), Tallinn

(Estonia) and Vilnius (Lithuania)

• 130 employees

• Strong R&D team

• 5 PhDs, 80+ research papers

• Trusted partner of the EU for

significant R&D projects

challenge

platform

challenge

[ttable-file]

0 0 5 /.../unfactored/model/phrase-table.0-0.gz

% ls steps/1/LM_toy_tokenize.1* | cat

steps/1/LM_toy_tokenize.1

steps/1/LM_toy_tokenize.1.DONE

steps/1/LM_toy_tokenize.1.INFO

steps/1/LM_toy_tokenize.1.STDERR

steps/1/LM_toy_tokenize.1.STDERR.digest

steps/1/LM_toy_tokenize.1.STDOUT

% train-model.perl \

--corpus factored-corpus/proj-syndicate \

--root-dir unfactored \

--f de --e en \

--lm 0:3:factored-corpus/surface.lm:0

% moses -f moses.ini -lmodel-file "0 0 3

../lm/europarl.srilm.gz“

use-berkeley = true

alignment-symmetrization-method = berkeley

berkeley-train = $moses-script-

dir/ems/support/berkeley-train.sh

berkeley-process = $moses-script-

dir/ems/support/berkeley-process.sh

berkeley-jar = /your/path/to/berkeleyaligner-

2.1/berkeleyaligner.jar

berkeley-java-options = "-server -mx30000m -ea"

berkeley-training-options = "-Main.iters 5 5 -

EMWordAligner.numThreads 8"

berkeley-process-options = "-

EMWordAligner.numThreads 8"

berkeley-posterior = 0.5

tokenize

in: raw-stem

out: tokenized-stem

default-name: corpus/tok

pass-unless: input-tokenizer output-tokenizer

template-if: input-tokenizer IN.$input-

extension OUT.$input-extension

template-if: output-tokenizer IN.$output-

extension OUT.$output-extension

parallelizable: yes

working-dir = /home/pkoehn/experiment

wmt10-data = $working-dir/data

customization

challenge

do-it-yourself

MT factory

on the cloud

• Automated training of SMT

systems from specified

collections of data

• Repository of parallel and

monolingual corpora

• based on open-source MT

tools GIZA and Moses

• Services for data collection,

MT generation,

customization and running of

variety of user-tailored MT

systems

Training Data Provided

Platform Architecture

Training UsingSharing of training data

Giza++

Moses SMT toolkit

SMT Resource Repository

SMT Multi-Model Repository

(trained SMT models)

System management, user authentication, access rights control ...

Web page

Web service

Web pagetranslation widget

CAT tools

Web browserPlug-ins

SMT Resource Directory

SMT System Directory

Moses decoder

• Integration with CAT tools

• Integration in web pages

• Integration in web

browsers

• API-level integration

integration

• Training data on the LetsMT!

platform

• 119 languages

• 2,1 B parallel units in total

• 253 language pairs

• 860 corpora

• 249 production MT systems

currently on

the platform

General Domain MT

English – Lithuanian

5.3 M parallel sentences

81 M monolingual sentences

QUALITY

LetsMT – 26.65 BLEU

Google – 25.85 BLEU

Beating

Google Translate

• MT service for

e-Government

• Mobile Translation

• Desktop Translation

%Productivity

►Average translation productivity:

►Baseline with TM only: 550 w/h

►With TM and MT: 731 w/h

32.9% productivity increase

►High variability in individual performance

►Increase of error score from 20.2 to 28.6 points but still at the level “GOOD” (<30 points)

Czech Polish

Latvian

How to instruct

SMT to use the

right terms?

terminology

service

cloud-based

platform for

acquiring, cleaning,

sharing, and reusing

multilingual

terminological data

TaaS Services

Term identification and annotation

Identifying and marking terms

Machine users

TaaS Terminology Services

ITS 2.0 enriched content

ITS2.0term-annotated content

export / visualisation

Showcase Web Page

Terminology Annotation

Web Service API

Plaintext

Term-annotated content

ITS2.0term-annotated

content

CAT Tools MT Systems

ITS2.0term-annotated

content

Human users(e.g., translators,

terminologists)

• New W3C standard for

Internationalization Tag Set ITS 2.0

HTML Term AnnotationTerm entries for terms identified in EuroTermBank are stored in TBX format in a <script> element that is placed in the HTML5 document.

XLIFF Term Annotation

Narrow Domain Automotive MT

English – Latvian

2 M unique parallel sentences

1.9 M monolingual sentences

0.2 M in-domain monolingual

QUALITY

16% improvement from

terminology integration

Beating

Google Translate

synergy of machine translation and terminology services on the cloud

tilde.com

The research within the projects LetsMT! and TaaS has received funding from the European Commission ICT Policy Support

Programme (ICT PSP) and FP7 Programme

thank you

TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

Technology

Juvenile Delinquency in Latvia Dr. Andrejs Judins, senior researcher of the Centre for Public Policy Providus 22.10. 2010

Current State, Strategy for the Future · 80 Mt 49 Mt 22 Mt 246 Mt 39 Mt 70 Mt 669 Mt 70 Mt 130 Mt ADANA 323 Mt Tufanbeyli %ø1*g/.DUOÕRYD 80 Mt a (6.øù(+ø5 Keles 1.500 M t.,5./$5(/ø

Information Technology in Logistics: Teaching Experiences ......Information Technology in Logistics: Teaching Experiences, Infrastructure and Technologies Andrejs Romanovs Department

Gulf of CarpentariaCobbler Mtn Mt McEvoy Mt Rankin Mt Sambo Mt Narounyah Poodyea Point The Three Sisters Mt Ball Mt Benmore Mt Macintosh Mt Tilpal Mt Sirloin Mt Sampson Mt Mica Surbiton

Challenges to Regulation: Intenet Ecosytem ANDREJS DOMBROVSKIS Deputy Director, Electronic Communications and Post (SPRK) BEREC – EMERG – EAPEREG - REGULATEL

Bologna Process Stocktaking Conclusions and Recommendations Prof. Andrejs Rauhvargers, Chair of Bologna Stocktaking WG 6th Bologna Ministerial Conference

MONDOLFO FERRO MT 2310 MT 2610 Easy-Alu MT 2610 MT 2310svc247.wic007tv.server-web.com/images/MT2310.pdf · MONDOLFO FERRO MT 2310 MT 2610 Easy-Alu MT 2610 MT 2310 . MT 2310 Easy-Alu

MT Procedure MT

Bologna Stocktaking 2007 1 Bologna Stocktaking for 2007 Andrejs Rauhvargers, Chair of Stocktaking WG

Latvia in the WSIS process Andrejs Vasiļjevs Tilde UNESCO IFAP Bureau.lv NIC 20 Conference, Riga, April 19, 2013

ANDREJS BESSONOVS OĻEGS KRASNOPJOROVS 1 / 2020

Semantic Web Andrejs Lesovskis. Publishing on the Web Making information available without knowing the eventual use; reuse, collaboration; reproduction

HLT Research and Development for Baltic Languages in Tilde Andrejs Vasiļjevs, Raivis Skadiņš Tilde Riga, October 27, 2004

Atherton - Wet Tropics of Queensland · Hells Gate Mt Oweenee Mt Courtney Mt Richardson The Sisters Teddy Mtn Mt Julia Mt Nokomis Mt Cataract Mt Halifax Mt Black Circle View Mountain

santral.com · MT 058 MT 059 MT 060 MT 055 MT 079 MT 070 MT 071 MT 072 MT 073 MT 074 MT 075 MT 028 MT 056 MT 1094 MT 1095 Cinsi Bir fazll a kapama alter iki fazll aqp ka ama ¥lter

Thailand exported 2,417,653 MT 2,417,653 MT 2,417,653 MT

ANDREJS BRATT, - DiVA portal1189629/FULLTEXT01.pdfT. Ampiisfima ifros latuere,inrernobisnatura; autem,miracula,poftquamqua majoresrccentiorumne« temporum induftrialimites,qui fenlibus

RTKs and rational cancer therapy Dr Andrejs Liepins/Science Photo Library

BIOL 445 Cancer Biology Dr. Gidi Shemer Spring 2012 Dr Andrejs Liepins/Science Photo Library

MAYO ISLAND - Australian Electoral Commission · Mt Sam Mt Playford Mt Sims Mt Sir Henry Mt Hardy Mt Toodlery The Bluff Mt Irving Mt Malua Mt John The Gibbers Prominent Point Mt Thornton