Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation

Bilingual term extraction revisited:Comparing statistical and linguistic methods for a new pair of languages

Špela Vintar

Faculty of ArtsDept. of TranslationUniversity of [email protected]

Overview

Term identification in a monolingual context: Some known approaches

Slovene-English setup: Corpora, tools, resources Multi-word terms: Collocations, nested terms and term

variants Bilingual lexicon extraction and term equivalence Extracting semantic information Evaluation & things to improve

Bilingual term extraction: Usual processing sequence

L1

L2

parallelcorpus

term candidates L1

term candidates L2

bilingual term lexiconfinding

translation equivalents

Corpus

Slovene-English parallel corpus of terminological texts (TRANS), ca. 1 million tokens; created within a student project at our Department; aligned with DejaVu (hand-validated); on-line concordancing at http://nl2.ijs.si/corpus/index2-bi.html

here: 2 subcorporaNuclear Engineering 25,000 tokensEconomic legislation 166,000 tokens

http://nl2.ijs.si/corpus/index2-bi.html

Linguistic processing

Slovene– tokenization – part-of-speech tagging

TnT (Brants 2000) training corpus creation tagger training & error correction

– lemmatization Amebis (thanks!) proprietory lemmatization tool;

non-disambiguated: je biti, jesti, on lemma disambiguation though self-made rules

Linguistic processing II

English– using DFKI tools (thanks!)– POS-tagging (TnT)– lemmatization (MMorph)– chunking (Chunkie)

What is a term: “keywordness”

Measures of keywordness: subcorpus vs. general language corpus

relative corpus frequency document vs. document collection

tf.idf

Applied to single or multi-word units.

Ndfi

weight(i, j) = (1 + log(tfi,j)) log —

Other indicators of termness

Acronyms (NPP, SG, RBB ...) Unknown words

– not found in the reference corpus– unknown to the lemmatizer

Cognates & Named entities

JE Krško Krško NPPKonzorcij ConsortiumSiemens/Framatome Siemens/Framatome

Identifying multi-word units

Collocation extraction techniques– Mutual Information (Church & Hanks 1990) – Log-likelihood ratio (Dunning 1993)– Entropy-based (Shimohata et al. 1997)– Semantic non-compositionality (Pearce 2001)

According to Daille (1994), LL is the most appropriate measure

for n > 3: n-gram frequency (+ stopword filtering) also works

N-gram term weighting

1. statistically extracted n-grams are not necessarily terms need for filtering / weighting

2. Stopword filtering

3. Weighting with tf.idf, ll-rank/core frequency

weight(tw1, w2, w3) = tf.idfw1tf.idfw2tf.idfw3/n * 1/rank

Treatment of nested terms

Local Max of bigram LL-scores

previous steam generator replacements



previous steam generator replacements 34,17 602,05 77,88



previous steam generator replacements 34,17 602,05 77,88

steam generator replacement requires 602,05 77,88 20,44


Local Max of bigram LL-scores C-value (Frantzi & Ananiadou 1996)

C-value(a) = (length(a) –1)(freq(a) – t(a)/c(a))

n-gram C-value

compressive force 10,3

axial compressive 5,2

axial compressive force 16,4

Extracting multi-word terms: Syntactic patterns

Extraction of terminologically relevant part-of-speech patterns (applied as regular expressions or finite state automata) (Heid 1998, 2001; Bourigault 1996; Jacquemin 2001)

Patterns enable extraction of single occurrences Patterns facilitate treatment of term variation

(replacement of steam generator = steam generator replacement) NN1 of NN2 NN3 = NN2 NN3 NN1

Patterns facilitate treatment of nesting – head of phrase may be easily established

Bilingual lexicon extraction

word-alignment tools: Twente (Hiemstra 1998), Egypt/Giza (Och 2000) , PLUG (Tiedemann 1999) etc.

comparison planned; currently using Twente– based on the Iterative Proportional Fitting Procedure (IPFP),

word-to-word translation model– outputs translation candidates + scores for each word in the

corpus; both ways– using stopword-filtered corpora to improve results

Output of Twente lexicon extraction

sprejeti sprejetje sprememba spremeniti------------------ ------------------ ------------------ ------------------adopted 0.45 adoption 0.94 amendments 0.54 amended 0.38approved 0.33 responsibilit 0.06 changes 0.21 will 0.17adoption 0.11 amendment 0.14 Health 0.16approval 0.10 Act 0.03 amending 0.03 Harmonized 0.02 evidence 0.03 devices 0.02 supplementing 0.03 medical 0.02 short 0.03 responsibilit 0.01 awaiting 0.03

spremljajocx spremljanje spricxevalo sprostiti------------------ ------------------ ------------------ ------------------accompanying 0.47 monitoring 1.00 referral 0.16 adapted 0.27responsibilit 0.16 issue 0.11 equestrian 0.27Institutions 0.16 attached 0.11 events 0.27800 0.07 changed 0.11 there 0.18regulates 0.05 veterinarians 0.11 free 0.01cost 0.03 attestations 0.11work 0.03 appointed 0.11begin 0.02 emergency 0.08

Extraction of cognates

string comparison on the level of types in two parallel segments: Perl module String::Approx (Hietanainen 2002)

high precision cognates override bilingual lexicon

term relevance! Nr. of extracted cognate

pairs:NE 364EL 776

informatika informaticsinfrastrukture infrastructureinstrumentacija instrumentationintegracija integratingintegrala integraliterativen iterativekarakteristik characteristicskaskade cascadekoeficient coefficientkomponenta componentkoncentracijo concentrationkoncept conceptkonstanta constantkonvergenca convergencekoordinat coordinateslinearne linearlogisticxna logisticmateriali materialsmatrika Matrix

Term alignment

for each source term candidate we collect all single-word equivalents from the bilingual lexicon projekt zamenjave uparjalnikov

project 1.00 [null] 0.00 steam 0.49generator 0.33generators 0.18

Term alignment

for each source term candidate we collect all single-word equivalents from the bilingual lexicon projekt zamenjave uparjalnikov

among extracted target terms we choose the one with highest match of words

scores are added up into equivalence scoresteam generator replacement project 1.82

project 1.00 [null] 0.00 steam 0.49generator 0.33generators 0.18

Bilingual term extraction: Statistical model

L1

L2

parallelcorpus

single-word terms

single-word termscontiguousn-grams (2-4)

contiguousn-grams (2-4)

tf.idf, cognates, unknown words

log-likelihood

stopwordfiltering

collapsingnesting

termweighting

multi-word terms

multi-word terms

bilinguallexicon

cognatepairs

termalignment

bilingual term candidates

Bilingual term extraction: Pattern-based model

L1

L2tagged &lemmatizedparallelcorpus

single-word terms

single-word termsmulti-wordpattern instances

multi-wordpattern instances

nouns only; tf.idf of lemmas, cognates

pattern grammar

stopwordfiltering

termweighting

multi-word terms

multi-word terms

bilinguallexicon

cognatepairs

termalignment

bilingual term candidates

<document id="Cerjak.al"> <tu id="Cerjak.2"> <terms> <slterm string="jedrska elektrarna" tokens="1 2" score="5.677" /> <slterm string="zamenjavi uparjalnikov" tokens="13 14" score="15.424" /> <slterm string="modernizacije elektrarne" tokens="21 22" score="12.956" /> <slterm string="jedrska elektrarna krsxko" tokens="1 2 3" score="7.7613" /> <slterm string="projektov modernizacije elektrarne" tokens="20 21 22" score="10.468" /> <slterm string="izmed projektov modernizacije elektrarne" tokens="19 20 21 22" score="8.719" /> <enterm string="nuclear power" tokens="3 4" score="10.754" /> <enterm string="power plant" tokens="4 5" score="11.620" /> <enterm string="replacement project" tokens="15 16" score="9.537" /> <enterm string="modernization projects" tokens="23 24" score="9.178" /> <enterm string="consortium siemens" tokens="30 31" score="7.326" /> <enterm string="nuclear power plant" tokens="3 4 5" score="12.497" /> <enterm string="krsxko nuclear power plant" tokens="2 3 4 5" score="13.064" /> <equiv string=“jedrska elektrarna krsxko" equiv="krsxko nuclear power plant" transscore="1.43" /> <equiv string="jedrska elektrarna" equiv="nuclear power plant" transscore="0.74" /> <equiv string="projektov modernizacije" equiv="modernization projects" transscore="1.84" /> </terms><seg lang="SL"> ....<seg lang="EN"> ....</tu></document>

High precision term pairs

padec tlaka 9.624 reduce pressure 1.09 13.599 opticxnih meritvah 7.635 Optical measurement 1.12 11.966 Opticxne meritve 9.484 Optical survey 1.9 10.264 nominalno tlacxno 8.413 nominal pressure 1.8 13.710 natancxno nacxrtovanje

9.112 detailed planning 1.11 4.428

glavne komponente 9.869 main components 1.44 9.237 francisovi turbini 7.040 Francis turbine 2 10.430 dvizxnim sistemom 8.003 sliding System 0.78 8.505 drsnim sistemom 8.003 sliding System 0.77 8.505 Delovni paket 8.214 work package 1.36 7.189 cevovodih sistema 7.455 piping System 1.57 13.206 blokirani legi 7.001 blocked position 1.95 5.728 turbulentne kineticxne energije

6.998 turbulent kinetic energy

1.87 8.916

razmerja FG FGcr 8.048 ratia FG FGcr 2.51 9.808 razlicxnimi turbulentnimi modeli

7.530 different turbulence models

2.79 10.290

Evaluation & Results

Evaluation dataSlovene: one hand-tagged document (by a group of translation students, not domain expert!) – 181 terms (including nestings)

Pattern-based term-tagging correctly detects 71 (precision x, recall 39.2%)

Reasons for missed terms: – term length > 4 – term variation (low tf.idf)– automatic filtering of nestings too rigid– tagging/lemmatization mistakes (pattern not extracted)

Fine-tuning of the weighting scheme needed (currently set too highest possible precision)

Documents

Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation