Transcript
Page 1: Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki

Checking Terminology Checking Terminology Consistency with Statistical Consistency with Statistical MethodsMethodsLRC XIIILRC XIII

22ndnd October 2008 October 2008

Alfredo Maldonado GuerraAlfredo Maldonado GuerraMicrosoft European Development CentreMicrosoft European Development Centre

Masaki ItagakiMasaki ItagakiMicrosoft CorporationMicrosoft Corporation

Page 2: Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki

About this presentationAbout this presentation

IntroductionIntroduction

Internal Consistency CheckInternal Consistency CheckStep 1: Mine Source TermsStep 1: Mine Source TermsStep 2: Identify translations of Source Terms Step 2: Identify translations of Source Terms (Alignment)(Alignment)Step 3: Consistency CheckStep 3: Consistency Check

Current ChallengesCurrent Challenges

TipsTips

Future ImprovementsFuture Improvements

Page 3: Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki

IntroductionIntroduction

Terminology Consistency: A key element of Terminology Consistency: A key element of localisedlocalised language qualitylanguage quality

Terminology Consistency: Difficult to maintainTerminology Consistency: Difficult to maintainDifficulty to keep source and target in synch during dev/loc Difficulty to keep source and target in synch during dev/loc processprocessTranslation done by several people (often working remotely)Translation done by several people (often working remotely)Terminology changes (e.g. between product versions)Terminology changes (e.g. between product versions)

Manual Language Quality Assurance (QA) can help, Manual Language Quality Assurance (QA) can help, howeverhowever

QA costs time and moneyQA costs time and moneyQA usually concentrates on a sample of the textQA usually concentrates on a sample of the textReviewer must be familiar with reference materialReviewer must be familiar with reference materialIt’s hard for humans to keep track of terminologyIt’s hard for humans to keep track of terminology

Page 4: Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki

IntroductionIntroduction

Can we use technology to control Can we use technology to control consistency?consistency?

Yes, but…Yes, but…Existing tools require term lists or term basesExisting tools require term lists or term basesNot all software companies have term bases set upNot all software companies have term bases set upCompanies that do have term bases won’t have Companies that do have term bases won’t have every single term captured – building a term base every single term captured – building a term base is always a work in progressis always a work in progress

Page 5: Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki

IntroductionIntroduction

Our Approach doesn’t require a term baseOur Approach doesn’t require a term base

By using Term Mining technology we identify By using Term Mining technology we identify terms on the source stringsterms on the source strings

We then check the translation consistency of We then check the translation consistency of the terminology mined the terminology mined

Page 6: Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki

Internal Consistency CheckInternal Consistency Check

112233

InconsistencInconsistency!y!

Page 7: Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki

Step 1: Source Term MiningStep 1: Source Term Mining

Page 8: Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki

Step 2: Translation AlignmentStep 2: Translation Alignment

Problem statement:Problem statement:

Given a mined source term S, identify the Given a mined source term S, identify the corresponding target term T in the translation corresponding target term T in the translation column.column.

Example:Example:Mined term: “input field” (S)Mined term: “input field” (S)

“ “champ d’entrée” (T)champ d’entrée” (T) “ “champ d’entrée” (T)champ d’entrée” (T)

Page 9: Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki

Step 2: Translation AlignmentStep 2: Translation Alignment

We need to consider all possible term We need to consider all possible term combinationscombinations

We call each combination an NGramWe call each combination an NGram

NGrams: where N = 2, 3, 4, maybe 5. NGrams: where N = 2, 3, 4, maybe 5.

For languages like German For languages like German

we even consider N = 1we even consider N = 1

How do we decide which NGram is the correct How do we decide which NGram is the correct translation for the term?translation for the term?

Bayesian statistics can help!Bayesian statistics can help!

Réattribue leurs valeurs initiales à tous les champs d'entrée.

Réattribue leurs

leurs valeurs

valeurs initiales

Initiales à

à les

Réattribue leurs valeurs

Leurs valeurs initiales

Page 10: Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki

Step 2: Translation AlignmentStep 2: Translation Alignment

Problem statement:Problem statement:

Given a source term S, obtain the NGram T that Given a source term S, obtain the NGram T that maximises the conditional probability functionmaximises the conditional probability function

[1][1]

But how do we calculate this?!But how do we calculate this?!

Page 11: Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki

Step 2: Translation AlignmentStep 2: Translation Alignment

[1][1]

Well, the multiplication rule of conditional probability tells us Well, the multiplication rule of conditional probability tells us thatthat

So [1] becomes:So [1] becomes:

[2][2]

And we also know that:And we also know that:

|NGrams| is the number of |NGrams| is the number of NGrams of the same N as T. NGrams of the same N as T. For example, if T is a 2 word For example, if T is a 2 word term (a bigram), term (a bigram), |NGrams| will be the amount |NGrams| will be the amount of NGrams made up of 2 of NGrams made up of 2 words.words.

|STSeg| is the number of |STSeg| is the number of segments (strings) that segments (strings) that contain both S in the source contain both S in the source column and T in the target column and T in the target column.column.

Page 12: Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki

Step 2: Translation AlignmentStep 2: Translation AlignmentIn our Best Target Term Selection Routine we will be comparing In our Best Target Term Selection Routine we will be comparing probabilities of different target terms (Tprobabilities of different target terms (Tkk’s):’s):

Since P(S) remains constant during these comparisons, we can eliminate Since P(S) remains constant during these comparisons, we can eliminate it.it.

We call the resulting equation I(TWe call the resulting equation I(Tkk):):

[3][3]

The candidate TThe candidate Tkk with the highest I, is our with the highest I, is our Best Target Term CandidateBest Target Term Candidate

Page 13: Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki

Step 2: Translation AlignmentStep 2: Translation Alignment

NormalisationNormalisationDepending on context any particular term can be Depending on context any particular term can be translated in a slightly different way.translated in a slightly different way.

For example: “file name” could be translated in Spanish For example: “file name” could be translated in Spanish as:as:

nombre de archivo nombre del archivo nombres de archivo nombres de archivos nombres de los archivos

Our algorithm has to be clever enough to realise that Our algorithm has to be clever enough to realise that “nombres de archivo” is just a form of “nombre de “nombres de archivo” is just a form of “nombre de archivo”. archivo”.

Page 14: Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki

Step 2: Translation AlignmentStep 2: Translation Alignment

NormalisationNormalisationSo, during NGram generation, we need to generate So, during NGram generation, we need to generate regular expressions for our termsregular expressions for our termsSince Asian languages do not inflect, regular Since Asian languages do not inflect, regular expressions are simpler for these languagesexpressions are simpler for these languages

For European languages we use more complex For European languages we use more complex regular expressionsregular expressions

Source Term Target Term (Italian)

Regular Expression Matches (admitted translations)

Error code codice errore \bcod\w{0,3}(\s\w{1,4}'?){0,2}\s?err\w{0,3}\b

codice d'errorecodice di errorecodice errorecodici di errore

Source Term Target Term (Japanese)

Regular Expression Matches (admitted translations)

Error code エラー コード \bエラー \s?コード \b エラー コード

Page 15: Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki

Step 3: Consistency CheckStep 3: Consistency Check

Detect the strings that do not use any of our Detect the strings that do not use any of our admitted translations admitted translations

Report these strings along with our findings Report these strings along with our findings to the userto the user

Page 16: Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki

Current ChallengesCurrent Challenges

False PositivesFalse PositivesDue to “heavy” rephrasingDue to “heavy” rephrasing

Unreliable for short, generic monogramsUnreliable for short, generic monograms

Source Term Admitted translations (Italian)

data d, d3d, da, dac, dai, dal, dall, data, dati, dato, dc, ddc, dei, del, dell, deny, der, deve, dfs, dhcp, di, dir, disk, dll, dma, dns, dopo, dos, dove, dpc, dsis, dtr, due, dvd, dwm

Page 17: Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki

Current ChallengesCurrent Challenges

Verbs can potentially cause problemsVerbs can potentially cause problemsDue to high inflection: Due to high inflection: amar => amo, amas, ama, amamos, amáis/aman, amanamar => amo, amas, ama, amamos, amáis/aman, amanvenir => vengo, vienes, viene, venimos, venís/vienen, venir => vengo, vienes, viene, venimos, venís/vienen, vienenvienenDifficult to differentiate from other parts of speech Difficult to differentiate from other parts of speech

Not all languages supported:Not all languages supported:ArabicArabicComplex Script languagesComplex Script languages

Source term Admitted translations Target Language

download descarga, descargar, descargó, descargue Spanish

install install, installa, installare, installata, installati, installato, installer

Italian

Page 18: Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki

Current ChallengesCurrent Challenges

Best Candidate Selection logic is very good, Best Candidate Selection logic is very good, but it’s not perfect. About 70% of term but it’s not perfect. About 70% of term selections are correct.selections are correct.

Incorrect selectionsIncorrect selections

Correct term highlightedCorrect term highlighted

Correct selectionsCorrect selections

Page 19: Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki

TipsTips

Make sure your data is clean to a certain Make sure your data is clean to a certain degree.degree.

Remove any HTML/XML tags from your stringsRemove any HTML/XML tags from your strings

Filter out any unlocalised strings and Filter out any unlocalised strings and non-localisable strings.non-localisable strings.

For Asian languages, run a word breaker tool For Asian languages, run a word breaker tool on your target strings (this is required for on your target strings (this is required for proper NGram handling)proper NGram handling)

Page 20: Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki

TipsTips

If you already have source term lists you’re If you already have source term lists you’re interested in, you can use them to bypass interested in, you can use them to bypass the term mining processthe term mining process

If your source terms are well selected, you’ll If your source terms are well selected, you’ll achieve very good results – A well selected achieve very good results – A well selected source term has a precise technical meaning. source term has a precise technical meaning. Source term Good/

BadReason

failure bad Too generic

data bad Too generic, forms part of many other terms: data type, data structure, etc.

worker process good Has a precise meaning

user account control good Has a precise meaning

Page 21: Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki

TipsTips

The more data you have, the more accurate The more data you have, the more accurate your results will beyour results will be

Try combining software data with help / user Try combining software data with help / user education data to increase term repetitionseducation data to increase term repetitions

Page 22: Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki

Future ImprovementsFuture Improvements

More work with More work with Adj + NounAdj + Noun

Work with verbsWork with verbs

Add support for Complex Script languages Add support for Complex Script languages and languages that inflect on different parts and languages that inflect on different parts of the wordof the word

Further refine Best Translation Candidate Further refine Best Translation Candidate Selection logicSelection logic

Page 23: Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki

Questions?Questions?

Page 24: Checking Terminology Consistency with Statistical Methods LRC XIII 2 nd October 2008 Alfredo Maldonado Guerra Microsoft European Development Centre Masaki

Thank You!Thank You!