Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki.

  • Published on

  • View

  • Download

Embed Size (px)


  • Slide 1

Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki Itagaki Microsoft Corporation Slide 2 About this presentation Introduction Internal Consistency Check Step 1: Mine Source Terms Step 2: Identify translations of Source Terms (Alignment) Step 3: Consistency Check Current Challenges Tips Future Improvements Slide 3 Introduction Terminology Consistency: A key element of localised language quality Terminology Consistency: Difficult to maintain Difficulty to keep source and target in synch during dev/loc process Translation done by several people (often working remotely) Terminology changes (e.g. between product versions) Manual Language Quality Assurance (QA) can help, however QA costs time and money QA usually concentrates on a sample of the text Reviewer must be familiar with reference material Its hard for humans to keep track of terminology Slide 4 Introduction Can we use technology to control consistency? Yes, but Existing tools require term lists or term bases Not all software companies have term bases set up Companies that do have term bases wont have every single term captured building a term base is always a work in progress Slide 5 Introduction Our Approach doesnt require a term base By using Term Mining technology we identify terms on the source strings We then check the translation consistency of the terminology mined Slide 6 Internal Consistency Check 1. Mine Source Terms2. Align Translations3. Consistency Check 123 Inconsistency! Slide 7 Step 1: Source Term Mining Bigram and Trigram extraction Noun phrases of the form Noun + Noun Noun + Noun + Noun Verb Phrases discriminated: 5% of terms Adjective Phrases discriminated: 2% of terms Monogram nouns discriminated: most are common words, and only 27% of terms are monograms In the future we might cover Adj + Noun forms Slide 8 Step 2: Translation Alignment Problem statement: Given a mined source term S, identify the corresponding target term T in the translation column. Example: Mined term: input field (S) champ dentre (T) champ dentre (T) Slide 9 Step 2: Translation Alignment We need to consider all possible term combinations We call each combination an NGram NGrams: where N = 2, 3, 4, maybe 5. For languages like German we even consider N = 1 How do we decide which NGram is the correct translation for the term? Bayesian statistics can help! Rattribue leurs valeurs initiales tous les champs d'entre. Rattribue leurs leurs valeurs valeurs initiales Initiales les Rattribue leurs valeurs Leurs valeurs initiales Slide 10 Step 2: Translation Alignment Problem statement: Given a source term S, obtain the NGram T that maximises the conditional probability function [1] But how do we calculate this?! Slide 11 Step 2: Translation Alignment [1] Well, the multiplication rule of conditional probability tells us that So [1] becomes: [2] And we also know that: |NGrams| is the number of NGrams of the same N as T. For example, if T is a 2 word term (a bigram), |NGrams| will be the amount of NGrams made up of 2 words. |STSeg| is the number of segments (strings) that contain both S in the source column and T in the target column. Slide 12 Step 2: Translation Alignment In our Best Target Term Selection Routine we will be comparing probabilities of different target terms (T k s): Since P(S) remains constant during these comparisons, we can eliminate it. We call the resulting equation I(T k ): [3] The candidate T k with the highest I, is our Best Target Term Candidate Slide 13 Step 2: Translation Alignment Normalisation Depending on context any particular term can be translated in a slightly different way. For example: file name could be translated in Spanish as: nombre de archivo nombre del archivo nombres de archivo nombres de archivos nombres de los archivos Our algorithm has to be clever enough to realise that nombres de archivo is just a form of nombre de archivo. Slide 14 Step 2: Translation Alignment Normalisation So, during NGram generation, we need to generate regular expressions for our terms Since Asian languages do not inflect, regular expressions are simpler for these languages For European languages we use more complex regular expressions Source TermTarget Term (Italian)Regular ExpressionMatches (admitted translations) Error codecodice errore\bcod\w{0,3}(\s\w{1,4}'?){0,2}\s? err\w{0,3}\b codice d'errore codice di errore codice errore codici di errore Source TermTarget Term (Japanese)Regular ExpressionMatches (admitted translations) Error code \b \s? \b Slide 15 Step 3: Consistency Check Detect the strings that do not use any of our admitted translations Report these strings along with our findings to the user Slide 16 Current Challenges False Positives Due to heavy rephrasing Unreliable for short, generic monograms Source TermAdmitted translations (Italian) datad, d3d, da, dac, dai, dal, dall, data, dati, dato, dc, ddc, dei, del, dell, deny, der, deve, dfs, dhcp, di, dir, disk, dll, dma, dns, dopo, dos, dove, dpc, dsis, dtr, due, dvd, dwm Slide 17 Current Challenges Verbs can potentially cause problems Due to high inflection: amar => amo, amas, ama, amamos, amis/aman, aman venir => vengo, vienes, viene, venimos, vens/vienen, vienen Difficult to differentiate from other parts of speech Not all languages supported: Arabic Complex Script languages Source termAdmitted translationsTarget Language downloaddescarga, descargar, descarg, descargueSpanish installinstall, installa, installare, installata, installati, installato, installerItalian Slide 18 Current Challenges Best Candidate Selection logic is very good, but its not perfect. About 70% of term selections are correct. Incorrect selections Correct term highlighted Correct selections Slide 19 Tips Make sure your data is clean to a certain degree. Remove any HTML/XML tags from your strings Filter out any unlocalised strings and non-localisable strings. For Asian languages, run a word breaker tool on your target strings (this is required for proper NGram handling) Slide 20 Tips If you already have source term lists youre interested in, you can use them to bypass the term mining process If your source terms are well selected, youll achieve very good results A well selected source term has a precise technical meaning. Source termGood/BadReason failurebadToo generic databadToo generic, forms part of many other terms: data type, data structure, etc. worker processgoodHas a precise meaning user account controlgoodHas a precise meaning Slide 21 Tips The more data you have, the more accurate your results will be Try combining software data with help / user education data to increase term repetitions Slide 22 Future Improvements More work with Adj + Noun Work with verbs Add support for Complex Script languages and languages that inflect on different parts of the word Further refine Best Translation Candidate Selection logic Slide 23 Questions? Slide 24 Thank You!