View
217
Download
0
Category
Tags:
Preview:
Citation preview
Managing Morphologically Complex Languages in Information Retrieval
Kal Järvelin & Many OthersKal Järvelin & Many Others
University of TampereUniversity of Tampere
1. Introduction1. Introduction
Morphologically complex languagesMorphologically complex languages unlike English, Chineseunlike English, Chinese rich inflectional and derivational rich inflectional and derivational
morphologymorphology rich compound formationrich compound formation
U. Tampere experiences 1998 - 2008U. Tampere experiences 1998 - 2008 monolingual IRmonolingual IR cross-language IRcross-language IR focus: Finnish, Germanic languages, Englishfocus: Finnish, Germanic languages, English
Methods for MorphologyMethods for MorphologyVariation
Management
GenerativeMethods
Word FormGeneration
Infl stemGeneration
Lemmatiz-ation
Stemming
GeneratingAll Forms
InflectionalStems enhanced
Rule-based
Rule-based
FCGInflectional
StemsRules +
DictRules +
Dict
ReductiveMethods
AgendaAgenda
1. Introduction1. Introduction 2. Reductive Methods2. Reductive Methods 3. Compounds3. Compounds 4. Generative Methods4. Generative Methods 5. Query Structures5. Query Structures 6. OOV Words6. OOV Words 7. Conclusion7. Conclusion
2. Normalization2. Normalization Reductive methods, conflationReductive methods, conflation
stemmingstemming lemmatizationlemmatization + conflation -> simpler searching+ conflation -> simpler searching + smaller index+ smaller index + provides query expansion+ provides query expansion
Stemming available for many languages Stemming available for many languages (e.g. Porter stemmer)(e.g. Porter stemmer)
Lemmatizers less available and more Lemmatizers less available and more demanding (dictionary requirement)demanding (dictionary requirement)
Alkula 2001Alkula 2001
Boolean environment, inflected index, Finnish:Boolean environment, inflected index, Finnish: manual truncation vs. automatic stemmingmanual truncation vs. automatic stemming stemming improves P and hurts Rstemming improves P and hurts R many derivatives are lostmany derivatives are lost
Boolean environment, infl vs. lemma index, Boolean environment, infl vs. lemma index, Finnish:Finnish: manual truncation vs. lemmatizationmanual truncation vs. lemmatization lemmatization improves P and hurts Rlemmatization improves P and hurts R many derivatives are lost, others correctly avoidedmany derivatives are lost, others correctly avoided
Differences not great between automatic Differences not great between automatic methodsmethods
Kettunen & al 2005Kettunen & al 2005
Ranked retrieval, Finnish:Ranked retrieval, Finnish: Three problems Three problems
how lemmatization and how lemmatization and inflectional stem inflectional stem generationgeneration compare in a best-match compare in a best-match environment?environment?
is a stemmer realistic for the handling Finnish is a stemmer realistic for the handling Finnish morphology?morphology?
feasibility of simulated truncation in a best-feasibility of simulated truncation in a best-match system?match system?
Lemmatized vs inflected form vs. stemmed Lemmatized vs inflected form vs. stemmed index.index.
Kettunen & al. 2005Kettunen & al. 2005 MethodMethod IndexIndex MAPMAP
Change %Change % FinTWOLFinTWOL lemmaslemmas 35.035.0 -- -- Inf Stem GenInf Stem Gen inflforminflform 34.234.2 - 2.3- 2.3 PorterPorter stemmed stemmed 27.727.7 - 20.9- 20.9 RawRaw inflform inflform 18.918.9 - 46.0- 46.0
But very long queries for inflectional stem But very long queries for inflectional stem generation & expansion (thousands of words); generation & expansion (thousands of words); weaker generations shorter but progressively weaker generations shorter but progressively deteriorating results.deteriorating results.
(InQuery/TUTK/graded-35/regular; )(InQuery/TUTK/graded-35/regular; )
Kettunen & al. 2005Kettunen & al. 2005
QuickTime™ and a decompressor
are needed to see this picture.
Language Index type Average P % Diff Base % English Inflected 43.4
Lemmas 45.6 2.2 Stemmed 46.3 2.9
Finnish Inflected 31.0 Lemmas 47.0 16.0 Stemmed 48.5 17.5
Swedish Inflected 30.2 Lemmas 31.4 1.2 Stemmed 33.5 3.3
German Inflected 30.2 Lemmas 31.9 1.7 Stemmed 35.7 5.5
InQuery/CLEF/TD/TWOL&Porter&Raw
MonoIR: Airio 2006MonoIR: Airio 2006
CLIR: Inflectional CLIR: Inflectional MorphologyMorphology
NL queries contain inflected form source NL queries contain inflected form source keyskeys
Dictionary headwords are in basic form Dictionary headwords are in basic form (lemmas)(lemmas)
Problem significance varies by languageProblem significance varies by language StemmingStemming
stem both the dictionary and the query wordsstem both the dictionary and the query words but may cause all too many translationsbut may cause all too many translations
Stemming in dictionary translation best applied Stemming in dictionary translation best applied after translation.after translation.
Lemmatization in CLIRLemmatization in CLIR
Lemmatization Lemmatization easy to access dictionarieseasy to access dictionaries but tokens may be ambiguous but tokens may be ambiguous dictionary translations not always in dictionary translations not always in
basic formbasic form lemmatizer’s dictionary coveragelemmatizer’s dictionary coverage
insufficient -> non-lemmatized source keys, insufficient -> non-lemmatized source keys, OOVsOOVs
too broad coverage -> too many senses too broad coverage -> too many senses providedprovided
CLIR Findings: Airio CLIR Findings: Airio 20062006
Target X Index type Average P % Diff to Split%
Finnish Lemmas 29.0 -6.5
Stemmed 20.8 -14.7
Swedish Lemmas 17.4 -9.7 Stemmed 19.0 -8.1
German Lemmas 26.4 -4.6 Stemmed 25.7 -5.3
English -> X
InQuery/UTAClir/CLEF/GlobalDix/TWOL&Porter
AgendaAgenda
1. Introduction1. Introduction 2. Reductive Methods2. Reductive Methods 3. Compounds3. Compounds 4. Generative Methods4. Generative Methods 5. Query Structures5. Query Structures 6. OOV Words6. OOV Words 7. Conclusion7. Conclusion
3. Compounds3. Compounds
Compounds, compound word typesCompounds, compound word types determinative: Weinkeller, vinkällare, life-jacketdeterminative: Weinkeller, vinkällare, life-jacket copulative: schwartzweiss, svartvit, black-and-copulative: schwartzweiss, svartvit, black-and-
whitewhite compositional: Stadtverwaltung, stadsförvaltningcompositional: Stadtverwaltung, stadsförvaltning non-compositional: Erdbeere, jordgubbe, non-compositional: Erdbeere, jordgubbe,
strawberrystrawberry
Note on spelling : compound word Note on spelling : compound word components are spelled together (if not -> components are spelled together (if not -> phrases)phrases)
Compound Word Compound Word TranslationTranslation
All compounds are not in dictionaryAll compounds are not in dictionary some languages are very productive some languages are very productive small dictionaries: atomic words, old non-small dictionaries: atomic words, old non-
compositional compoundscompositional compounds large dictionaries: many compositional large dictionaries: many compositional
compounds addedcompounds added Compounds remove phrase Compounds remove phrase
identification problems, but cause identification problems, but cause translation and query formulation translation and query formulation problemsproblems
Joining MorphemesJoining Morphemes Joining morphemes Joining morphemes
complicate compound complicate compound analysis & translationanalysis & translation
Joining morpheme Joining morpheme types in Swedishtypes in Swedish <omission> flicknamn <omission> flicknamn -s rätt-s rättssfall fall -e flick-e flickeebarn barn -a gäst-a gästaabud bud -u gat-u gatuubelysning belysning -o människ-o människookärlekkärlek
Joining morpheme Joining morpheme types in Germantypes in German -s Handel-s Handelssvertragvertrag -n Affe-n Affennhaushaus -e Gäst-e Gästeebettbett -en Fotograph-en Fotographenenaus- aus-
bildungbildung
-er Gespenst-er Gespenstererhaushaus -es Freund-es Freundeseskreiskreis -ens Herz-ens Herzensensbrecherbrecher <omission> <omission>
Sprachwissen-schaftSprachwissen-schaftSuggestive finding that the treatment of joining morphemes improves MAP by 2 %- Hedlund 2002, SWE->ENG, 11 Qs
Compound Processing, 2Compound Processing, 2 A Finnish natural A Finnish natural
language query: language query: lääkkeet sydänvaivoihinlääkkeet sydänvaivoihin (medicines for heart (medicines for heart
problems) problems) The output of The output of
morphological morphological analysisanalysis lääke lääke sydänvaiva, sydän, sydänvaiva, sydän,
vaivavaiva
Dictionary translation and Dictionary translation and the output of component the output of component tagging: tagging: lääke ---> medication druglääke ---> medication drug sydänvaiva - ”not in dict”sydänvaiva - ”not in dict” sydän ---> heartsydän ---> heart vaiva ---> ailment, vaiva ---> ailment,
complaint, discomfort, complaint, discomfort, inconvenience, trouble, inconvenience, trouble, vexationvexation
Many ways to combine Many ways to combine components in querycomponents in query
Compound Processing, 3Compound Processing, 3 Sample Sample English CLIR queryEnglish CLIR query::
#sum( #sum( #syn( medication drug )#syn( medication drug ) heartheart #syn( ailment, #syn( ailment, complaint, discomfort, inconvenience, trouble, complaint, discomfort, inconvenience, trouble, vexation ))vexation ))
i.e. translating as if source compounds were phrasesi.e. translating as if source compounds were phrases Source compound handling may vary here:Source compound handling may vary here:
#sum( #sum( #syn( medication drug )#syn( medication drug ) #syn(#uw3( #syn(#uw3( heartheart ailment ) #uw3( ailment ) #uw3( heartheart complaint ) #uw3( complaint ) #uw3( heartheart discomfort ) #uw3( discomfort ) #uw3( heartheart inconvenience ) inconvenience ) #uw3( #uw3( heartheart trouble ) #uw3( trouble ) #uw3( heartheart vexation ))) vexation )))
#uw3 = proximity operator for three intervening #uw3 = proximity operator for three intervening words, free word orderwords, free word order
i.e. forming all proximity combinations as synonym i.e. forming all proximity combinations as synonym sets.sets.
Compound Processing, 4Compound Processing, 4
No clear benefits seen from using No clear benefits seen from using proximity combinations.proximity combinations.
We did neither observe a great We did neither observe a great effect in changing the proximity effect in changing the proximity operator (OD vs. UW)operator (OD vs. UW)
Some monolingual results follow Some monolingual results follow (Airio 2006)(Airio 2006)
Language Index type Average P %
Diff to Baseline
English Inflected 43.4
Lemmas 45.6 2.2 Stemmed 46.3 2.9 Finnish Inflected 31.0
Lemma & decomp 50.5 19.5 Lemmas 47.0 16.0 Stemmed 48.5 17.5 Swedish Inflected 30.2 Lemma & decomp 38.8 8.6 Lemmas 31.4 1.2 Stemmed 33.5 3.3 German Inflected 30.2 Lemma & decomp 36.2 6.0 Lemmas 31.9 1.7 Stemmed 35.7 5.5
InQuery/CLEF/Raw&TWOL&Porter
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
Finnish
English
Swedish
Morphological complexity increases
Hedlund 2002Hedlund 2002
Compound translation as Compound translation as compounds:compounds: 47 German CLEF 2001 topics, English docs
collection. comprehensive dictionary (many compounds) vs.
small dict (no compounds) mean AP 34.7% vs. 30.4% dictionary matters ...
Alternative approach: if not translatable, split and translate components
CLEF Ger -> EngCLEF Ger -> Eng
1. best manually translated 0,4465
2. large dict, no comp splitting 0,3520
3. limited dict, no comp splitting 0,3057
4. large dictionary & comp splitting 0,3830
5. limited dict & comp splitting 0,3547
InQuery/UTAClir/CLEF/Duden/TWOL/UW 5+n
CLIR Findings: Airio CLIR Findings: Airio 20062006
Target language
Index type Average P %
Diff to Baseline%
Finnish Lemma & decomp 35.5
Lemmas 29.0 -6.5 Stemmed 20.8 -14.7 Swedish Lemma & decomp 27.1 Lemmas 17.4 -9.7 Stemmed 19.0 -8.1 German Lemma & decomp 31.0 Lemmas 26.4 -4.6 Stemmed 25.7 -5.3
English ->
InQuery/UTAClir/CLEF/GlobalDix/TWOL&Porter
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
Eng->Ger
QuickTime™ and a decompressor
are needed to see this picture.
Eng->Swe
QuickTime™ and a decompressor
are needed to see this picture.
Eng->Fin
AgendaAgenda
1. Introduction1. Introduction 2. Reductive Methods2. Reductive Methods 3. Compounds3. Compounds 4. Generative Methods4. Generative Methods 5. Query Structures5. Query Structures 6. OOV Words6. OOV Words 7. Conclusion7. Conclusion
4. Generative Methods4. Generative MethodsVariationhandling
GenerativeMethods
Word FormGeneration
Infl stemGeneration
Lemmatiz-ation
Stemming
GeneratingAll Forms
InflectionalStems, ench
Rule-based
Rule-based
FCGInflectional
StemsRules +
DictRules +
Dict
ReductiveMethods
Generative Methods: inf Generative Methods: inf stemsstems
Instead of normalization, generate Instead of normalization, generate inflectional stems for an inflectional inflectional stems for an inflectional index.index. then using stems harvest full forms then using stems harvest full forms
from the indexfrom the index long queries ...long queries ...
... OR ...... OR ...
Instead of normalization, generate Instead of normalization, generate full inflectional forms for an full inflectional forms for an inflectional index.inflectional index. Long queries? Sure!Long queries? Sure! Sounds absolutely crazy ...Sounds absolutely crazy ...
... BUT!... BUT!
Are morphologically complex Are morphologically complex languages languages that that complex in IR complex in IR in in practicepractice??
Instead of full form generation, only Instead of full form generation, only generate generate sufficientsufficient forms -> FCG forms -> FCG
In Finnish, 9-12 forms cover 85% of In Finnish, 9-12 forms cover 85% of all occurrences of nounsall occurrences of nouns
Kettunen & al 2006: Kettunen & al 2006: FinnishFinnish
IRIR MAP for relevance level MAP for relevance level
MethodMethod Liberal Liberal NormalNormal StringentStringent
TWOLTWOL 37.8 37.8 35.035.0 24.124.1
FCG12FCG12 32.7 32.7 30.0 30.0 21.4 21.4
FCG6FCG6 30.9 30.9 28.0 28.0 21.0 21.0
SnowballSnowball 29.8 29.8 27.7 27.7 20.0 20.0
RawRaw 19.619.6 18.918.9 12.412.4
... monolingual ...
Kettunen & al 2007: Kettunen & al 2007: Other LangsOther Langs
IRIR MAP for Language MAP for Language
MethodMethod Swe Swe GerGer RusRus
TWOLTWOL 32.6 32.6 39.739.7 ....
FCGFCG 30.6 30.6 /4/4 38.0 38.0 /4/4 32.7 /2 32.7 /2
FCGFCG 29.1 29.1 /2/2 36.8 36.8 /2/2 29.2 /6 29.2 /6
SnowballSnowball 28.5 28.5 39.1 39.1 34.734.7
RawRaw 24.0 24.0 35.935.9 29.8 29.8 Results for long queries ... monolingual ...
CLIR Findings: Airio CLIR Findings: Airio 20082008
Language Language pairspairs
Raw Raw transltransl
Fi-FCG_9Fi-FCG_9
Sv-FCG_4Sv-FCG_4
Fi-FCG_12Fi-FCG_12
Sv-FCG_7Sv-FCG_7
Lemma-Lemma-tizedtized
Fin -> Fin -> EngEng 11.211.2 32.432.4 32.532.5 39.639.6
Fin -> Fin -> SweSwe 14.314.3 22.622.6 23.923.9 35.235.2
Eng -> Eng -> SweSwe 18.118.1 25.125.1 27.327.3 34.134.1
Swe -> Swe -> FinFin 11.711.7 28.028.0 27.927.9 37.637.6
AgendaAgenda
1. Introduction1. Introduction 2. Reductive Methods2. Reductive Methods 3. Compounds3. Compounds 4. Generative Methods4. Generative Methods 5. Query Structures5. Query Structures 6. OOV Words6. OOV Words 7. Conclusion7. Conclusion
5. Query Structures5. Query Structures
Translation ambiguity such as ...Translation ambiguity such as ... Homonymy: homophony, homographyHomonymy: homophony, homography
Examples: platform, bank, bookExamples: platform, bank, book Inflectional homographyInflectional homography
Examples: train, trains, training Examples: train, trains, training Examples: book, books, booking Examples: book, books, booking
PolysemyPolysemy Examples: back, trainExamples: back, train
... a problem in CLIR.... a problem in CLIR.
Ambiguity ResolutionAmbiguity Resolution
MethodsMethods Part-of-speech tagging (e.g. Ballesteros Part-of-speech tagging (e.g. Ballesteros
& Croft ‘98)& Croft ‘98) Corpus-based methods Ballesteros & Corpus-based methods Ballesteros &
Croft ‘96; ‘97; Chen & al. ‘99)Croft ‘96; ‘97; Chen & al. ‘99) Query ExpansionQuery Expansion CollocationsCollocations
Query structuring - the Pirkola Method Query structuring - the Pirkola Method (1998)(1998)
Query Structuring Query Structuring From weak to From weak to
strong query strong query structures by structures by recognition of ...recognition of ... conceptsconcepts expression weightsexpression weights phrases, compoundsphrases, compounds
Queries may be Queries may be combined ... query combined ... query fusionfusion
Concepts?
no yes
Phrases ?
Weighting ? Weighting ?
no yes no yes
Phrases ?
no yes no yes
#sum(a b c d e)
#wsum(1 3 #syn(a #3(b c)) 1 #syn(d e))
~~ ~~
~~ ~~
Structured Queries in Structured Queries in CLIRCLIR
CLIR performance (Pirkola 1998, 1999)CLIR performance (Pirkola 1998, 1999) English baselines, manual Finnish English baselines, manual Finnish
translationstranslations Automatic dictionary translation FIN -> ENGAutomatic dictionary translation FIN -> ENG
natural language queries (NL) vs. concept natural language queries (NL) vs. concept queries (BL)queries (BL)
structured vs. unstructured translationsstructured vs. unstructured translations single words (NL/S) vs. phrases marked (NL/WP)single words (NL/S) vs. phrases marked (NL/WP) general and/or special dictionary translationgeneral and/or special dictionary translation
500.000 document TREC subcollection500.000 document TREC subcollection probabilistic retrieval (InQuery)probabilistic retrieval (InQuery) 30 health-related requests30 health-related requests
The Pirkola MethodThe Pirkola Method
All translations of all senses All translations of all senses provided by the dictionary are provided by the dictionary are incorporated in the queryincorporated in the query
All translations of each source All translations of each source language word are combined by the language word are combined by the synonym operator, synonym groups synonym operator, synonym groups by #and or #sumby #and or #sum this effectively provides disambiguationthis effectively provides disambiguation
An ExampleAn Example
Consider the Finnish natural Consider the Finnish natural language query: language query: lääke sydänvaiva [= medicine lääke sydänvaiva [= medicine
heart_problem]heart_problem] Sample Sample English CLIR queryEnglish CLIR query::
#sum( #sum( #syn( medication drug )#syn( medication drug ) heart heart #syn( ailment, complaint, discomfort, #syn( ailment, complaint, discomfort, inconvenience, trouble, vexation ) inconvenience, trouble, vexation ) ))
Each source word forming a synonym Each source word forming a synonym setset
Query Translation Test Query Translation Test Set-upSet-up
InQueryTRECUnix-server
Translated Finnish RequestEnglish Request
Finnish NL Query Finnish BL Query
General Dict Med. Dict.
Baseline Queries Translated English Queries
General Dict Med. Dict.
Unstructured NL/S Unstructured NL/S QueriesQueries
Only 38% of
the average baselineprecision
(sd&gd)
Baseline
#sum(tw11, tw12, ... , tw21, tw22, ... twn1, ... , twnk)
0
5
10
15
20
25
30
35
40
10 20 30 40 50 60 70 80 90 100
Precision
Recall
GD SD --> GDSD and GD BL
Baseline
77% of the
average baselineprecision (sd & gd)
Structure doubles
precision in all cases#and(#syn(tw11, tw12, ... ), #syn(tw21, tw22, ...), #syn( twn1, ..., twnk))
Structured Queries w/ Structured Queries w/ Special DictionarySpecial Dictionary
Query Structuring, More Query Structuring, More ResultsResults
CLEF CLEF
Topic set 2000 Topic set 2000 (N = 33)(N = 33)
Unstructured Unstructured queriesqueries
no s-gramsno s-grams
Structured Structured queriesqueries
no s-gramsno s-grams
Change %Change %
Finnish - EngFinnish - Eng 0.16090.1609 0.21500.2150 33.633.6
German - EngGerman - Eng 0.20970.2097 0.26390.2639 25.825.8
Swedish - EngSwedish - Eng 0.20150.2015 0.22420.2242 11.311.3
Topic set 2001Topic set 2001 (N = 47)(N = 47)
Finnish - EngFinnish - Eng 0.24070.2407 0.34430.3443 43.043.0
German - EngGerman - Eng 0.32420.3242 0.38300.3830 18.118.1
Swedish - EngSwedish - Eng 0.34660.3466 0.34650.3465 0.00.0
Transit CLIR – Query Transit CLIR – Query StructuresStructures
Average precision for the transitive, bilingual and monolingual runs of CLEF 2001 topics (N = 50)
Transitive CLIR Results, Transitive CLIR Results, 22
2 0 0 1 T o p i c s
0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
0 2 0 4 0 6 0 8 0 1 0 0
R e c a l l
Precision
s w e - f i - e n g
s w e - e n g
f i n - s w e - e n
f i n - e n g
m o n o l e n g
Transitive CLIR Transitive CLIR EffectivenessEffectiveness
RegularRelevance
MAPN=35
% mono-lingual perfor-mance
% direct performance
Swe-Eng-Fin 21.8 59** 68**
Eng-Swe-Fin 27.5 75 88
Ger-Eng-Fin 24.3 66** 83
Ger-Swe-Fin 29.3 79 100
Mean 25.7 70 85
Lehtokangas & al 2008
TransCLIR + pRF TransCLIR + pRF effectivenesseffectiveness
RegularRelevance
MAPN=35
% monolingual +
pRF
% monolingual
% direct
% pRF exp direct
Swe-Eng-Fin 28.7 68** 78 90 78*
Eng-Swe-Fin 32.3 76* 88 103 87
Ger-Eng-Fin 33.6 79 91 115 104
Ger-Swe-Fin 34.5 81 94 118 107
Mean 32.3 76 88 107 94
AgendaAgenda
1. Introduction1. Introduction 2. Reductive Methods2. Reductive Methods 3. Compounds3. Compounds 4. Generative Methods4. Generative Methods 5. Query Structures5. Query Structures 6. OOV Words6. OOV Words 7. Conclusion7. Conclusion
6. OOV Words6. OOV Words Low coverage -- non-translated wordsLow coverage -- non-translated words
Domain-specific terms in general dictionariesDomain-specific terms in general dictionaries e.g. dystrophye.g. dystrophy Covered in domain-specific dictionariesCovered in domain-specific dictionaries
Compound wordsCompound words Proper names: persons, geographical names, …Proper names: persons, geographical names, … Often central for query effectivenessOften central for query effectiveness
Large dictionaries solution? BUT:Large dictionaries solution? BUT: Excessive number of senses and words for eachExcessive number of senses and words for each Increases ambiguity problemsIncreases ambiguity problems and still many words remain OOVsand still many words remain OOVs
OOV ProcessingOOV Processing
Proper names are often spelled differently Proper names are often spelled differently between languagesbetween languages Transliteration variationTransliteration variation Brussels, Bryssel; Chernobyl, Tshernobyl; Brussels, Bryssel; Chernobyl, Tshernobyl;
Chechnya, Tsetsenia Chechnya, Tsetsenia Non-translated keys are often used as such Non-translated keys are often used as such
in CLIR queries -- simplistic approach, not in CLIR queries -- simplistic approach, not optimaloptimal
In some languages, proper names may In some languages, proper names may inflectinflect In Finnish ”in Cairo” = KairoIn Finnish ”in Cairo” = Kairossassa
Approximate String Approximate String MatchingMatching
Means for Means for correcting spelling errorscorrecting spelling errors matching variant word forms (e.g. matching variant word forms (e.g.
proper names between languages)proper names between languages) MethodsMethods
edit-distance, n-grams, edit-distance, n-grams, skip-gramsskip-grams soundex, phonix, based of phone soundex, phonix, based of phone
similaritysimilarity transliterationtransliteration
Sample digram MatchingSample digram Matching
Sample words Bryssel, Brussels, Bruxelles Sample words Bryssel, Brussels, Bruxelles Bryssel --> Bryssel --> N1= {br, ry, ys, ss, se, el}N1= {br, ry, ys, ss, se, el} Brussels --> Brussels --> N2 = {br, ru, us, ss, se, el, ls}N2 = {br, ru, us, ss, se, el, ls} Bruxelles --> Bruxelles --> N3 = {br, ru, ux, xe, el, ll, le, N3 = {br, ru, ux, xe, el, ll, le,
es}es} sim(N1, N2) = | N1 sim(N1, N2) = | N1 N2| / | N1 N2| / | N1 N2| N2| = 4 / 9 = 0.444= 4 / 9 = 0.444 sim(N1, N3) = 1/6 = 0.167sim(N1, N3) = 1/6 = 0.167
Skip-gramsSkip-grams
Generalizing n-gram matchingGeneralizing n-gram matching The strings to be compared are split The strings to be compared are split
into substrings of length ninto substrings of length n Skipping characters is allowedSkipping characters is allowed Substrings produced using various skip Substrings produced using various skip
lengthslengths n-grams remain a special case: no skipsn-grams remain a special case: no skips ((Pirkola & Keskustalo & al 2002Pirkola & Keskustalo & al 2002))
An ExampleAn Example katalyyttinen – catalytic katalyyttinen – catalytic
skip-0:skip-0: {_k, ka, at, ta, al, ly, yy, yt, tt, ti, in, ne, en, n_} {_k, ka, at, ta, al, ly, yy, yt, tt, ti, in, ne, en, n_} {_c, cs, at, ta, al, ly, yt, ti, ic, c_}{_c, cs, at, ta, al, ly, yt, ti, ic, c_}
skip-1:skip-1: {_a, kt, aa, tl, ay, ly, yt, yt, ti, tn, ie, nn, e_} {_a, kt, aa, tl, ay, ly, yt, yt, ti, tn, ie, nn, e_} {_a, ct, aa, tl, ay, lt, yi, tc, i_}{_a, ct, aa, tl, ay, lt, yi, tc, i_}
skip-2:skip-2: {_t, ka, al, ty, ay, lt, yt, yi, tn, te, in, n_}{_t, ka, al, ty, ay, lt, yt, yi, tn, te, in, n_} {_t, ca, al, ty, at, lc, yc, t_}{_t, ca, al, ty, at, lc, yc, t_}
calculate similarity over different skip-gram calculate similarity over different skip-gram setssets
Skip-gram effectivenessSkip-gram effectiveness
Several Several nn-gram methods tested-gram methods tested nn-grams & skip-grams-grams & skip-grams with & without paddingwith & without padding
The relative improvement for some The relative improvement for some ss--grams vs. grams vs. nn-grams in X -> Finnish name -grams in X -> Finnish name matching: matching: 18.2% (Eng medical) 18.2% (Eng medical) 49.7% (Eng 49.7% (Eng
geograph) geograph) 20.7% (Ger geograph) 20.7% (Ger geograph) 17.1% (Swe 17.1% (Swe
geograph)geograph) Statistically significant, Statistically significant, pp = 0.01-0.001 = 0.01-0.001
CLIR Findings: Järvelin & CLIR Findings: Järvelin & al 2008al 2008
Closely related languagesClosely related languages Norwegian and SwedishNorwegian and Swedish
Translation by string matching aloneTranslation by string matching alone no dictionary at all, no other vocabulary no dictionary at all, no other vocabulary
sourcesource encouraging resultsencouraging results
CLIR Findings: Järvelin & CLIR Findings: Järvelin & al 2008al 2008
QuickTime™ and a decompressor
are needed to see this picture.
CLIR Findings: Airio CLIR Findings: Airio 20082008
Language pairsLanguage pairs Raw Raw transltransl
Fi-sgram_9Fi-sgram_9
Sv-sgram_4Sv-sgram_4
Fi-sgram_12Fi-sgram_12
Sv-sgram_7Sv-sgram_7
Lemma-Lemma-tizedtized
Fin -> Fin -> EngEng
11.211.2 29.229.2 31.031.0 39.639.6
Eng -> Eng -> SweSwe
18.118.1 25.725.7 25.325.3 34.134.1
Swe -> Swe -> Fin-Fin-
11.711.7 26.126.1 26.726.7 37.637.6
Fin -> Fin -> SweSwe
14.314.3 26.726.7 22.622.6 35.235.2
*) target index not normalized
*) *) *)
Rule-based Rule-based TransliterationTransliteration
Between languages, the originally ‘same’ Between languages, the originally ‘same’ words have different spellings:words have different spellings: Proper names, technical termsProper names, technical terms
Transliteration rules are based on regular Transliteration rules are based on regular variations in the spelling of equivalent word variations in the spelling of equivalent word forms between languagesforms between languages
construction construction - konstruktio, - konstruktio, c -> kc -> k somatology somatology - somatologia, - somatologia, y -> iay -> ia universiti universiti - university, - university, i -> y i -> y
Transliteration rules mined in a bilingual Transliteration rules mined in a bilingual word listword list
Their frequency and reliability are recordedTheir frequency and reliability are recorded
Sample Rules German-Sample Rules German-to-Englishto-English
SourceSource TargetTarget LocationLocationConfidence Confidence
string string stringstring of the rule of the rule factorfactor
ekt ekt ect ect middlemiddle 89.289.2 m m ma ma endend 21.121.1 akt akt act act middlemiddle 86.786.7 ko ko coco beginningbeginning 80.780.7
TRT Translation
OOVSourceWord
TranslationCandidates
TRT RuleBase
Skip-gram matching
Target Index
Identified Target
Word(s)
TRT RuleProduction
Precision
Recall
TRT Translation – 2 TRT Translation – 2 StepsSteps
1 2
Evaluation
TRT – rule collectionsTRT – rule collections
Rule collectionRule collection # Rules# Rules # # RulesRules
CF>4.0%, CF>4.0%, Fr. >2Fr. >2
Spanish-EnglishSpanish-English 88008800 12951295 Spanish-GermanSpanish-German 54125412 984984 Spanish-FrenchSpanish-French 97249724 14301430 German-EnglishGerman-English 86098609 12191219 French-EnglishFrench-English 98739873 11701170
TRT EffectivenessTRT Effectiveness
Finnish-to-English translation (HCF)Finnish-to-English translation (HCF)
Term typeTerm type Digrams TRT + digrams Digrams TRT + digrams % chg% chg
Bio terms Bio terms 61,461,4 72,0 72,0 +17,3+17,3
Place namesPlace names 30,030,0 35,935,9 +19,7+19,7 EconomicsEconomics 32,232,2 38,038,0 +18,0+18,0 TechnologyTechnology 31,631,6 53,753,7 +69,9+69,9 MiscellaneousMiscellaneous 33,833,8 40,640,6 +20,1+20,1
A Problem with TRTA Problem with TRT
TRT gives many target word forms for a source TRT gives many target word forms for a source word (even tens of thousands) but does not word (even tens of thousands) but does not indicate the correct oneindicate the correct one
For example, in Spanish-English translation For example, in Spanish-English translation TRT gives the following forms of for a Spanish TRT gives the following forms of for a Spanish word biosintesis:word biosintesis: biosintesis, biosintessis, biosinthesis, biosinthessis, biosintesis, biosintessis, biosinthesis, biosinthessis,
biosyntesis, biosyntessis, biosynthesis, biosynthessisbiosyntesis, biosyntessis, biosynthesis, biosynthessis To identify the correct equivalent, we use FITETo identify the correct equivalent, we use FITE
frequency-based identification of equivalentsfrequency-based identification of equivalents
Sample Sample Frequency Frequency PatternPattern
The frequency pattern associated with the The frequency pattern associated with the English target word candidatesEnglish target word candidates for the for the Spanish wordSpanish word biosintesis:biosintesis:
Target candidateTarget candidate Doc FreqDoc Freq biosynthesisbiosynthesis 2 230 0002 230 000 biosintesisbiosintesis 909 909 biosyntesis biosyntesis 634 634 biosinthesisbiosinthesis 255 255 biosynthessisbiosynthessis 3 3 biosintessisbiosintessis 0 0 ...... ... ...
TRT Translation
OOVSourceWord
TranslationCandidates
TRT RuleBase
FITEIdentification
FrequencyStatistics
Identified TargetWord
TRT RuleProduction
Frequencypattern Ok
Relative Frequency Ok
LengthDifference Ok
NativeSourceWord
FITE-TRT TranslationFITE-TRT Translation
1 2
Precision
Recall
Evaluation
FITE-TRT effectivenessFITE-TRT effectiveness
Spanish-English biological and Spanish-English biological and medical spelling variants (n = 89)medical spelling variants (n = 89)
Finnish-English biological and Finnish-English biological and medical spelling variants (n = 89)medical spelling variants (n = 89)
Translation toward English by TRTTranslation toward English by TRT English equivalent identification by English equivalent identification by
FITEFITE
FITE-TRT EffectivenessFITE-TRT Effectiveness
Translation Recall and Precision for bio-Translation Recall and Precision for bio-terms terms
Source Source TranslationTranslation TranslatTranslat LanguageLanguage Recall % Recall % Prec % Prec % SpanishSpanish Web Web 91.0 91.0 98.8 98.8 Freq ListFreq List 82.0 82.0 98.8 98.8 Finnish Finnish Web Web 71.9 71.9 97.0 97.0 Freq List Freq List 67.4 67.4 97.3 97.3
FITE-TRT effectivenessFITE-TRT effectiveness
UTACLIR UTACLIR TREC Genomics Track TREC Genomics Track 2004 topics 2004 topics Spanish-English actual OOV words (n = Spanish-English actual OOV words (n =
93+5)93+5) Finnish-English actual OOV words (n = Finnish-English actual OOV words (n =
48+5)48+5) Translation toward English by TRTTranslation toward English by TRT English equivalents identification by English equivalents identification by
FITEFITE
FITE-TRT EffectivenessFITE-TRT Effectiveness
Translation Recall and Precision for actual Translation Recall and Precision for actual OOV words OOV words
Source Source TranslationTranslation TranslationTranslation
LanguageLanguage Recall % Recall % Precision % Precision %
SpanishSpanish Web Web 89.2 89.2 97.6 97.6
Freq ListFreq List 87.1 87.1 97.6 97.6
Finnish Finnish Web Web 72.9 72.9 97.2 97.2
Freq List Freq List 79.2 79.2 95.0 95.0
Genomics CLIR Genomics CLIR experimentexperiment
German - English CLIR in genomicsGerman - English CLIR in genomics TREC Genomics Track data, 50 TREC Genomics Track data, 50
topicstopics (20 training + 30 testing)(20 training + 30 testing) MedLine 4.6 M medical abstractsMedLine 4.6 M medical abstracts Baseline: raw German as such + Baseline: raw German as such +
Dict translDict transl Test queries - FITE-TRTTest queries - FITE-TRT
Genomics CLIR Genomics CLIR experimentexperiment
QuickTime™ and a decompressor
are needed to see this picture.
AgendaAgenda
1. Introduction1. Introduction 2. Reductive Methods2. Reductive Methods 3. Compounds3. Compounds 4. Generative Methods4. Generative Methods 5. Query Structures5. Query Structures 6. OOV Words6. OOV Words 7. Conclusion7. Conclusion
8. Conclusions8. Conclusions
Monolingual retrievalMonolingual retrieval morphological complexity morphological complexity
who owns the index?who owns the index? reductive and generative approachesreductive and generative approaches skewed distributions; surprisingly little may skewed distributions; surprisingly little may
be enoughbe enough compound handling perhaps not critical compound handling perhaps not critical
in monolingual retrievalin monolingual retrieval
More ConclusionsMore Conclusions Cross-language retrievalCross-language retrieval
Query structuring simple and effectiveQuery structuring simple and effective Closely related languages / dialectsClosely related languages / dialects
simple language independent techniques maybe simple language independent techniques maybe enoughenough
Observe what happens between compounding vs Observe what happens between compounding vs isolating languagesisolating languages
compound splitting seems essentialcompound splitting seems essential Inflected form indicesInflected form indices
skip-grams or FCG after translation may be a solutionskip-grams or FCG after translation may be a solution Specific domains: Dom dictionaries or aligned Specific domains: Dom dictionaries or aligned
corpcorp
Thanks!Thanks!
Recommended