Upload
ezra-stephen-williamson
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Bilingual term extraction revisited:Comparing statistical and linguistic methods for a new pair of languages
Špela Vintar
Faculty of ArtsDept. of TranslationUniversity of [email protected]
Overview
Term identification in a monolingual context: Some known approaches
Slovene-English setup: Corpora, tools, resources Multi-word terms: Collocations, nested terms and term
variants Bilingual lexicon extraction and term equivalence Extracting semantic information Evaluation & things to improve
Bilingual term extraction: Usual processing sequence
L1
L2
parallelcorpus
term candidates L1
term candidates L2
bilingual term lexiconfinding
translation equivalents
Corpus
Slovene-English parallel corpus of terminological texts (TRANS), ca. 1 million tokens; created within a student project at our Department; aligned with DejaVu (hand-validated); on-line concordancing at http://nl2.ijs.si/corpus/index2-bi.html
here: 2 subcorporaNuclear Engineering 25,000 tokensEconomic legislation 166,000 tokens
Linguistic processing
Slovene– tokenization – part-of-speech tagging
TnT (Brants 2000) training corpus creation tagger training & error correction
– lemmatization Amebis (thanks!) proprietory lemmatization tool;
non-disambiguated: je biti, jesti, on lemma disambiguation though self-made rules
Linguistic processing II
English– using DFKI tools (thanks!)– POS-tagging (TnT)– lemmatization (MMorph)– chunking (Chunkie)
What is a term: “keywordness”
Measures of keywordness: subcorpus vs. general language corpus
relative corpus frequency document vs. document collection
tf.idf
Applied to single or multi-word units.
Ndfi
weight(i, j) = (1 + log(tfi,j)) log —
Other indicators of termness
Acronyms (NPP, SG, RBB ...) Unknown words
– not found in the reference corpus– unknown to the lemmatizer
Cognates & Named entities
JE Krško Krško NPPKonzorcij ConsortiumSiemens/Framatome Siemens/Framatome
Identifying multi-word units
Collocation extraction techniques– Mutual Information (Church & Hanks 1990) – Log-likelihood ratio (Dunning 1993)– Entropy-based (Shimohata et al. 1997)– Semantic non-compositionality (Pearce 2001)
According to Daille (1994), LL is the most appropriate measure
for n > 3: n-gram frequency (+ stopword filtering) also works
N-gram term weighting
1. statistically extracted n-grams are not necessarily terms need for filtering / weighting
2. Stopword filtering
3. Weighting with tf.idf, ll-rank/core frequency
weight(tw1, w2, w3) = tf.idfw1tf.idfw2tf.idfw3/n * 1/rank
Treatment of nested terms
Local Max of bigram LL-scores
previous steam generator replacements
Treatment of nested terms
Local Max of bigram LL-scores
previous steam generator replacements 34,17 602,05 77,88
Treatment of nested terms
Local Max of bigram LL-scores
previous steam generator replacements 34,17 602,05 77,88
steam generator replacement requires 602,05 77,88 20,44
Treatment of nested terms
Local Max of bigram LL-scores C-value (Frantzi & Ananiadou 1996)
C-value(a) = (length(a) –1)(freq(a) – t(a)/c(a))
n-gram C-value
compressive force 10,3
axial compressive 5,2
axial compressive force 16,4
Extracting multi-word terms: Syntactic patterns
Extraction of terminologically relevant part-of-speech patterns (applied as regular expressions or finite state automata) (Heid 1998, 2001; Bourigault 1996; Jacquemin 2001)
Patterns enable extraction of single occurrences Patterns facilitate treatment of term variation
(replacement of steam generator = steam generator replacement) NN1 of NN2 NN3 = NN2 NN3 NN1
Patterns facilitate treatment of nesting – head of phrase may be easily established
Bilingual lexicon extraction
word-alignment tools: Twente (Hiemstra 1998), Egypt/Giza (Och 2000) , PLUG (Tiedemann 1999) etc.
comparison planned; currently using Twente– based on the Iterative Proportional Fitting Procedure (IPFP),
word-to-word translation model– outputs translation candidates + scores for each word in the
corpus; both ways– using stopword-filtered corpora to improve results
Output of Twente lexicon extraction
sprejeti sprejetje sprememba spremeniti------------------ ------------------ ------------------ ------------------adopted 0.45 adoption 0.94 amendments 0.54 amended 0.38approved 0.33 responsibilit 0.06 changes 0.21 will 0.17adoption 0.11 amendment 0.14 Health 0.16approval 0.10 Act 0.03 amending 0.03 Harmonized 0.02 evidence 0.03 devices 0.02 supplementing 0.03 medical 0.02 short 0.03 responsibilit 0.01 awaiting 0.03
spremljajocx spremljanje spricxevalo sprostiti------------------ ------------------ ------------------ ------------------accompanying 0.47 monitoring 1.00 referral 0.16 adapted 0.27responsibilit 0.16 issue 0.11 equestrian 0.27Institutions 0.16 attached 0.11 events 0.27800 0.07 changed 0.11 there 0.18regulates 0.05 veterinarians 0.11 free 0.01cost 0.03 attestations 0.11work 0.03 appointed 0.11begin 0.02 emergency 0.08
Extraction of cognates
string comparison on the level of types in two parallel segments: Perl module String::Approx (Hietanainen 2002)
high precision cognates override bilingual lexicon
term relevance! Nr. of extracted cognate
pairs:NE 364EL 776
informatika informaticsinfrastrukture infrastructureinstrumentacija instrumentationintegracija integratingintegrala integraliterativen iterativekarakteristik characteristicskaskade cascadekoeficient coefficientkomponenta componentkoncentracijo concentrationkoncept conceptkonstanta constantkonvergenca convergencekoordinat coordinateslinearne linearlogisticxna logisticmateriali materialsmatrika Matrix
Term alignment
for each source term candidate we collect all single-word equivalents from the bilingual lexicon projekt zamenjave uparjalnikov
project 1.00 [null] 0.00 steam 0.49generator 0.33generators 0.18
Term alignment
for each source term candidate we collect all single-word equivalents from the bilingual lexicon projekt zamenjave uparjalnikov
among extracted target terms we choose the one with highest match of words
scores are added up into equivalence scoresteam generator replacement project 1.82
project 1.00 [null] 0.00 steam 0.49generator 0.33generators 0.18
Bilingual term extraction: Statistical model
L1
L2
parallelcorpus
single-word terms
single-word termscontiguousn-grams (2-4)
contiguousn-grams (2-4)
tf.idf, cognates, unknown words
log-likelihood
stopwordfiltering
collapsingnesting
termweighting
multi-word terms
multi-word terms
bilinguallexicon
cognatepairs
termalignment
bilingual term candidates
Bilingual term extraction: Pattern-based model
L1
L2tagged &lemmatizedparallelcorpus
single-word terms
single-word termsmulti-wordpattern instances
multi-wordpattern instances
nouns only; tf.idf of lemmas, cognates
pattern grammar
stopwordfiltering
termweighting
multi-word terms
multi-word terms
bilinguallexicon
cognatepairs
termalignment
bilingual term candidates
<document id="Cerjak.al"> <tu id="Cerjak.2"> <terms> <slterm string="jedrska elektrarna" tokens="1 2" score="5.677" /> <slterm string="zamenjavi uparjalnikov" tokens="13 14" score="15.424" /> <slterm string="modernizacije elektrarne" tokens="21 22" score="12.956" /> <slterm string="jedrska elektrarna krsxko" tokens="1 2 3" score="7.7613" /> <slterm string="projektov modernizacije elektrarne" tokens="20 21 22" score="10.468" /> <slterm string="izmed projektov modernizacije elektrarne" tokens="19 20 21 22" score="8.719" /> <enterm string="nuclear power" tokens="3 4" score="10.754" /> <enterm string="power plant" tokens="4 5" score="11.620" /> <enterm string="replacement project" tokens="15 16" score="9.537" /> <enterm string="modernization projects" tokens="23 24" score="9.178" /> <enterm string="consortium siemens" tokens="30 31" score="7.326" /> <enterm string="nuclear power plant" tokens="3 4 5" score="12.497" /> <enterm string="krsxko nuclear power plant" tokens="2 3 4 5" score="13.064" /> <equiv string=“jedrska elektrarna krsxko" equiv="krsxko nuclear power plant" transscore="1.43" /> <equiv string="jedrska elektrarna" equiv="nuclear power plant" transscore="0.74" /> <equiv string="projektov modernizacije" equiv="modernization projects" transscore="1.84" /> </terms><seg lang="SL"> ....<seg lang="EN"> ....</tu></document>
High precision term pairs
padec tlaka 9.624 reduce pressure 1.09 13.599 opticxnih meritvah 7.635 Optical measurement 1.12 11.966 Opticxne meritve 9.484 Optical survey 1.9 10.264 nominalno tlacxno 8.413 nominal pressure 1.8 13.710 natancxno nacxrtovanje
9.112 detailed planning 1.11 4.428
glavne komponente 9.869 main components 1.44 9.237 francisovi turbini 7.040 Francis turbine 2 10.430 dvizxnim sistemom 8.003 sliding System 0.78 8.505 drsnim sistemom 8.003 sliding System 0.77 8.505 Delovni paket 8.214 work package 1.36 7.189 cevovodih sistema 7.455 piping System 1.57 13.206 blokirani legi 7.001 blocked position 1.95 5.728 turbulentne kineticxne energije
6.998 turbulent kinetic energy
1.87 8.916
razmerja FG FGcr 8.048 ratia FG FGcr 2.51 9.808 razlicxnimi turbulentnimi modeli
7.530 different turbulence models
2.79 10.290
Evaluation & Results
Evaluation dataSlovene: one hand-tagged document (by a group of translation students, not domain expert!) – 181 terms (including nestings)
Pattern-based term-tagging correctly detects 71 (precision x, recall 39.2%)
Reasons for missed terms: – term length > 4 – term variation (low tf.idf)– automatic filtering of nestings too rigid– tagging/lemmatization mistakes (pattern not extracted)
Fine-tuning of the weighting scheme needed (currently set too highest possible precision)