31
Creating Translation Context with Disambiguation Tadej Štajner – Jožef Stefan Institute Yves Savourel – ENLASO Corporation Localization World – London – June 20

Creating Translation Context with Disambiguation

Embed Size (px)

DESCRIPTION

Creating Translation Context with Disambiguation. Tadej Štajner – Jožef Stefan Institute Yves Savourel – ENLASO Corporation. Localization World – London – June 2013. Context: A Shortcoming. Traditionally, translation tools have been strong on code handling, re-use of existing translations. - PowerPoint PPT Presentation

Citation preview

Creating Translation Contextwith Disambiguation

Tadej Štajner – Jožef Stefan InstituteYves Savourel – ENLASO Corporation

Localization World – London – June 2013

Context: A Shortcoming

• Traditionally, translation tools have been strong on code handling, re-use of existing translations.

• But they have been less good at providing context or linguistic resources for the translators.

• Things are improving and are bound to improve even more.

New Factors

• Component-based processing is becoming wide-spread (i.e. source text goes through several preparation steps: TM, MT, etc.)

• Web services allow a single process to tape on many different resources; specialization becomes easier.

• Now ITS 2.0 provides a common way to carry various information across tools/services.

ITS: Internationalization Tag Set

• A set of common internationalization and localization-related features (called “data categories”) for XML…and now with ITS 2.0 also for HTML5

• ITS 2.0 is being finalized at the W3Chttp://www.w3.org/TR/its20/

ITS and “Context”

• ITS 2.0 offers several data categories that can help with contextual information: Localization Note, Terminology, Id Value, Domain and Text Analysis.

• Quick glance at the first four,then in-depth look at Text Analysis.

Localization Note

Comments put in the source document and meant to be seen by the translators.

<msg its:locNote="%s is for On or Off">Click the %s button</msg>

Terminology

Annotates a “term” in the content and, optionally, provides additional related information.

<p>We need a new <span its-term=yesits-term-info-ref= "http://en.wikipedia.org/wiki/Motherboard">motherboard</span>.</p>

Id Value

Provides a way to associates unique IDs with parts of the content during translation.Can be useful for software text where IDs are often descriptive.

<its:idValueRule selector="//msg" idValue="@name"/>...<msg name="FILENOTFOUND">Not found</msg>

Domain

Allows to identify the general topic area of the content to translate.Can be useful for selecting MT engines.

<its:domainRule selector="/h:html" domainPointer="/h:html/h:head/h:meta[@name='keywords']/@content"/>...<meta name="keywords" content="automotive"/>

Text Analysis

• Annotates content with lexical or conceptual information.

• Useful for many things:– Term suggestion– General context information– Suggestion of things not to translate– Automated transliteration of proper names– Etc.

Text Analysis: An Example

Enrycher is an example of component generating Text Analysis annotations that can be easily integrated with translation tools or localization processes.

Motivation

• Translating proper names

… can be problematic for statistical MT systems

Motivation (2)

• There are specific rules to translate (or transliterate) proper names

• Solution: figure out what is actually being mentioned and see if any existing translated expression exists for that entity

Motivation (3)

• Examples: personal names, product names, or geographic names, chemical compounds, protein names

• Names and phrases appear in situations without sufficient context (UI labels, etc.)

ITS 2.0 Text Analysis

• Support text analysis agents that enhance content by suggesting or identifying concepts, identified by IRIs.

• A TextAnalysis annotates a text fragment with:– entity type– entity identifier– confidence

Text Analysis in ITS 2.0– what can it tell us?

• Does a text fragment represent some entity?– London is lovely in the summer.– Out of 73 known entities named London, we

mean a particular one: http://dbpedia.org/resource/London

• … a particular type of entity?– London is a phrase, representing a location

• … and with what confidence?

ITS 2.0 Text Analysis<!DOCTYPE html><div its-annotators-ref="text-analysis|http://enrycher.ijs.si/mlw/toolinfo.xml#enrycher"> <span its-ta-ident-ref="http://dbpedia.org/resource/London" its-ta-class-ref="http://schema.org/Place">London</span> is the <span its-ta-ident-ref="http://purl.org/vocabularies/princeton/wn30/synset-capital-noun-3.rdf">capital</span> of <span its-ta-ident-ref="http://dbpedia.org/resource/United_Kingdom" its-ta-class-ref="http://schema.org/Place">United Kingdom</span>.</div>

Producing these annotations

• Manual annotation• Automated NLP Techniques

– Named entity extraction & disambiguation – Word sense disambiguation

Use cases

• Informing a human agent (i.e. translator) that a certain fragment of text is subject to follow specific translation rules:– proper names– officially regulated translations.

• Informing a software agent (i.e. CMS) about the conceptual type of a textual entity in order to enable special processing or indexing

Named entity disambiguationDocument

Label

Entity

Mention

Named entity disambiguation – behind the scenes

• A difficult problem:– A name can refer to many entities, an entity can

have many names– Which interpretation is correct?

• Humans are pretty good at this1.We have prior knowledge on the ‘usual’ meanings2.We can glean the meaning from the context3.Things that are related, appear together

Named entity disambiguation – behind the scenes (2)

1. Prior knowledge: what is the most frequent meaning of ‘London’?

2. Context: someone using the word ‘London’ in the context of ‘Canada’ is likely to be referring to another London in Ontario

Named entity disambiguation – behind the scenes (3)

3. Relational similarity: things connected in the knowledge graph tend to appear together

Building blocks of Enrycher

• Token-level analysis– Sentence splitting– Tokenization– Lemmatization– Part-of-speech tagging

• Entity-level analysis– Named entity extraction– Co-reference resolution– Anaphora resolution– Named entity

disambiguation

• Document-level analysis– Sentiment analysis– Topic classification– Keyword extraction

(not used here)

Using Enrycher

• A HTTP service endpoint: send HTML5 in, get enriched HTML5+ITS2.0 out

• Multilingual: supports English and Slovene• See http://enrycher.ijs.si/mlw/, or try it from

the command line:$ curl -d "<p>Welcome to London</p>"

http://enrycher.ijs.si/mlw/en/entityIdent.html5its2<p>Welcome to <span its-ta-ident-ref="http://dbpedia.org/resource/London"

its-ta-class-ref="http://schema.org/Place">London</span></p>

Enrycher Integrated in Okapi

• The Okapi Framework is an open-source and cross-platform set of components designed to help building localization processes.

• One of its components is a client of the Enrycher services.

• Text Analysis annotations can be applied to any document in a format supported by the Okapi filters.

Translation Kit

Extraction Step

Enrycher Step

Trans-Kit Creation

Step

Enrycher Server

InputFile

OtherSteps…

XLIFF Terms

Term Extraction

Step

One example of usage ofthe Enrycher Web services

Enrycher Step

• Convert batches of segments (in Okapi’s internal format) into HTML paragraphs and send them to the Enrycher service.

• Converts back the annotated paragraphs into Okapi’s internal format.

• Next steps can use the Text Analysis metadata, e.g. XLIFF output, OmegaT comments, etc.

Term Extraction Step

• The Term Extraction Step offers various simple ways to guess terms in a source content.

• One of its methods is to re-use the content annotated with the Text Analysis metadata to feed the list of term candidates.

Demonstration…

Questions?

• Enrycher:http://enrycher.ijs.si/

• Okapi Framework:http://okapi.opentag.com/

• ITS 2.0:http://www.w3.org/TR/its20/