78
Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Kal Järvelin & Many Others Others University of Tampere University of Tampere

Managing Morphologically Complex Languages in Information Retrieval

  • Upload
    purity

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

Managing Morphologically Complex Languages in Information Retrieval. Kal Järvelin & Many Others University of Tampere. 1. Introduction. Morphologically complex languages unlike English, Chinese rich inflectional and derivational morphology rich compound formation - PowerPoint PPT Presentation

Citation preview

Page 1: Managing Morphologically Complex Languages in Information Retrieval

Managing Morphologically Complex Languages in Information Retrieval

Kal Järvelin & Many OthersKal Järvelin & Many OthersUniversity of TampereUniversity of Tampere

Page 2: Managing Morphologically Complex Languages in Information Retrieval

1. Introduction1. Introduction Morphologically complex languagesMorphologically complex languages

unlike English, Chineseunlike English, Chinese rich inflectional and derivational morphologyrich inflectional and derivational morphology rich compound formationrich compound formation

U. Tampere experiences 1998 - 2008U. Tampere experiences 1998 - 2008 monolingual IRmonolingual IR cross-language IRcross-language IR focus: Finnish, Germanic languages, Englishfocus: Finnish, Germanic languages, English

Page 3: Managing Morphologically Complex Languages in Information Retrieval

Methods for MorphologyMethods for MorphologyVariation

Management

GenerativeMethods

Word FormGeneration

Infl stemGeneration

Lemmatiz-ationStemming

GeneratingAll Forms

InflectionalStems enhanced

Rule-based

Rule-based

FCGInflectionalStems

Rules +Dict

Rules +Dict

ReductiveMethods

Page 4: Managing Morphologically Complex Languages in Information Retrieval

AgendaAgenda 1. Introduction1. Introduction 2. Reductive Methods2. Reductive Methods 3. Compounds3. Compounds 4. Generative Methods4. Generative Methods 5. Query Structures5. Query Structures 6. OOV Words6. OOV Words 7. Conclusion7. Conclusion

Page 5: Managing Morphologically Complex Languages in Information Retrieval

2. Normalization2. Normalization Reductive methods, conflationReductive methods, conflation

stemmingstemming lemmatizationlemmatization + conflation -> simpler searching+ conflation -> simpler searching + smaller index+ smaller index + provides query expansion+ provides query expansion

Stemming available for many languages Stemming available for many languages (e.g. Porter stemmer)(e.g. Porter stemmer)

Lemmatizers less available and more Lemmatizers less available and more demanding (dictionary requirement)demanding (dictionary requirement)

Page 6: Managing Morphologically Complex Languages in Information Retrieval

Alkula 2001Alkula 2001 Boolean environment, inflected index, Finnish:Boolean environment, inflected index, Finnish:

manual truncation vs. automatic stemmingmanual truncation vs. automatic stemming stemming improves P and hurts Rstemming improves P and hurts R many derivatives are lostmany derivatives are lost

Boolean environment, infl vs. lemma index, Boolean environment, infl vs. lemma index, Finnish:Finnish: manual truncation vs. lemmatizationmanual truncation vs. lemmatization lemmatization improves P and hurts Rlemmatization improves P and hurts R many derivatives are lost, others correctly avoidedmany derivatives are lost, others correctly avoided

Differences not great between automatic Differences not great between automatic methodsmethods

Page 7: Managing Morphologically Complex Languages in Information Retrieval

Kettunen & al 2005Kettunen & al 2005 Ranked retrieval, Finnish:Ranked retrieval, Finnish: Three problems Three problems

how lemmatization and how lemmatization and inflectional stem inflectional stem generationgeneration compare in a best-match compare in a best-match environment?environment?

is a stemmer realistic for the handling Finnish is a stemmer realistic for the handling Finnish morphology?morphology?

feasibility of simulated truncation in a best-match feasibility of simulated truncation in a best-match system?system?

Lemmatized vs inflected form vs. stemmed Lemmatized vs inflected form vs. stemmed index.index.

Page 8: Managing Morphologically Complex Languages in Information Retrieval

Kettunen & al. 2005Kettunen & al. 2005 MethodMethod IndexIndex MAPMAP Change %Change % FinTWOLFinTWOL lemmaslemmas 35.035.0 -- -- Inf Stem GenInf Stem Gen inflforminflform 34.234.2 - 2.3- 2.3 PorterPorter stemmed stemmed 27.727.7 - 20.9- 20.9 RawRaw inflform inflform 18.918.9 - 46.0- 46.0

But very long queries for inflectional stem But very long queries for inflectional stem generation & expansion (thousands of words); generation & expansion (thousands of words); weaker generations shorter but progressively weaker generations shorter but progressively deteriorating results.deteriorating results.

(InQuery/TUTK/graded-35/regular; )(InQuery/TUTK/graded-35/regular; )

Page 9: Managing Morphologically Complex Languages in Information Retrieval

Kettunen & al. 2005Kettunen & al. 2005

QuickTime™ and a decompressor

are needed to see this picture.

Page 10: Managing Morphologically Complex Languages in Information Retrieval

Language Index type Average P % Diff Base % English Inflected 43.4

Lemmas 45.6 2.2 Stemmed 46.3 2.9

Finnish Inflected 31.0 Lemmas 47.0 16.0 Stemmed 48.5 17.5

Swedish Inflected 30.2 Lemmas 31.4 1.2 Stemmed 33.5 3.3

German Inflected 30.2 Lemmas 31.9 1.7 Stemmed 35.7 5.5

InQuery/CLEF/TD/TWOL&Porter&Raw

MonoIR: Airio 2006MonoIR: Airio 2006

Page 11: Managing Morphologically Complex Languages in Information Retrieval

CLIR: Inflectional CLIR: Inflectional MorphologyMorphology

NL queries contain inflected form source keysNL queries contain inflected form source keys Dictionary headwords are in basic form Dictionary headwords are in basic form

(lemmas)(lemmas) Problem significance varies by languageProblem significance varies by language StemmingStemming

stem both the dictionary and the query wordsstem both the dictionary and the query words but may cause all too many translationsbut may cause all too many translations

Stemming in dictionary translation best applied Stemming in dictionary translation best applied after translation.after translation.

Page 12: Managing Morphologically Complex Languages in Information Retrieval

Lemmatization in CLIRLemmatization in CLIR Lemmatization Lemmatization

easy to access dictionarieseasy to access dictionaries but tokens may be ambiguous but tokens may be ambiguous dictionary translations not always in basic dictionary translations not always in basic

formform lemmatizer’s dictionary coveragelemmatizer’s dictionary coverage

insufficient -> non-lemmatized source keys, insufficient -> non-lemmatized source keys, OOVsOOVs

too broad coverage -> too many senses too broad coverage -> too many senses providedprovided

Page 13: Managing Morphologically Complex Languages in Information Retrieval

CLIR Findings: Airio CLIR Findings: Airio 20062006

Target X Index type Average P % Diff to Split% Finnish Lemmas 29.0 -6.5 Stemmed 20.8 -14.7 Swedish Lemmas 17.4 -9.7 Stemmed 19.0 -8.1 German Lemmas 26.4 -4.6 Stemmed 25.7 -5.3

English -> X

InQuery/UTAClir/CLEF/GlobalDix/TWOL&Porter

Page 14: Managing Morphologically Complex Languages in Information Retrieval

AgendaAgenda 1. Introduction1. Introduction 2. Reductive Methods2. Reductive Methods 3. Compounds3. Compounds 4. Generative Methods4. Generative Methods 5. Query Structures5. Query Structures 6. OOV Words6. OOV Words 7. Conclusion7. Conclusion

Page 15: Managing Morphologically Complex Languages in Information Retrieval

3. Compounds3. Compounds Compounds, compound word typesCompounds, compound word types

determinative: Weinkeller, vinkällare, life-jacketdeterminative: Weinkeller, vinkällare, life-jacket copulative: schwartzweiss, svartvit, black-and-copulative: schwartzweiss, svartvit, black-and-

whitewhite compositional: Stadtverwaltung, stadsförvaltningcompositional: Stadtverwaltung, stadsförvaltning non-compositional: Erdbeere, jordgubbe, non-compositional: Erdbeere, jordgubbe,

strawberrystrawberry

Note on spelling : compound word Note on spelling : compound word components are spelled together (if not -> components are spelled together (if not -> phrases)phrases)

Page 16: Managing Morphologically Complex Languages in Information Retrieval

Compound Word Compound Word TranslationTranslation

All compounds are not in dictionaryAll compounds are not in dictionary some languages are very productive some languages are very productive small dictionaries: atomic words, old non-small dictionaries: atomic words, old non-

compositional compoundscompositional compounds large dictionaries: many compositional large dictionaries: many compositional

compounds addedcompounds added Compounds remove phrase identification Compounds remove phrase identification

problems, but cause translation and problems, but cause translation and query formulation problemsquery formulation problems

Page 17: Managing Morphologically Complex Languages in Information Retrieval

Joining MorphemesJoining Morphemes Joining morphemes Joining morphemes

complicate compound complicate compound analysis & translationanalysis & translation

Joining morpheme Joining morpheme types in Swedishtypes in Swedish <omission> flicknamn <omission> flicknamn -s rätt-s rättssfall fall -e flick-e flickeebarn barn -a gäst-a gästaabud bud -u gat-u gatuubelysning belysning -o människ-o människookärlekkärlek

Joining morpheme Joining morpheme types in Germantypes in German -s Handel-s Handelssvertragvertrag -n Affe-n Affennhaushaus -e Gäst-e Gästeebettbett -en Fotograph-en Fotographenenaus- aus-

bildungbildung

-er Gespenst-er Gespenstererhaushaus -es Freund-es Freundeseskreiskreis -ens Herz-ens Herzensensbrecherbrecher <omission> <omission>

Sprachwissen-schaftSprachwissen-schaftSuggestive finding that the treatment of joining morphemes improves MAP by 2 %- Hedlund 2002, SWE->ENG, 11 Qs

Page 18: Managing Morphologically Complex Languages in Information Retrieval

Compound Processing, 2Compound Processing, 2 A Finnish natural A Finnish natural

language query: language query: lääkkeet sydänvaivoihinlääkkeet sydänvaivoihin (medicines for heart (medicines for heart

problems) problems) The output of The output of

morphological morphological analysisanalysis lääke lääke sydänvaiva, sydän, sydänvaiva, sydän,

vaivavaiva

Dictionary translation and Dictionary translation and the output of component the output of component tagging: tagging: lääke ---> medication druglääke ---> medication drug sydänvaiva - ”not in dict”sydänvaiva - ”not in dict” sydän ---> heartsydän ---> heart vaiva ---> ailment, vaiva ---> ailment,

complaint, discomfort, complaint, discomfort, inconvenience, trouble, inconvenience, trouble, vexationvexation

Many ways to combine Many ways to combine components in querycomponents in query

Page 19: Managing Morphologically Complex Languages in Information Retrieval

Compound Processing, 3Compound Processing, 3 Sample Sample English CLIR queryEnglish CLIR query::

#sum( #sum( #syn( medication drug )#syn( medication drug ) heartheart #syn( ailment, #syn( ailment, complaint, discomfort, inconvenience, trouble, vexation complaint, discomfort, inconvenience, trouble, vexation ))))

i.e. translating as if source compounds were phrasesi.e. translating as if source compounds were phrases Source compound handling may vary here:Source compound handling may vary here:

#sum( #sum( #syn( medication drug )#syn( medication drug ) #syn(#uw3( #syn(#uw3( heartheart ailment ) #uw3( ailment ) #uw3( heartheart complaint ) #uw3( complaint ) #uw3( heartheart discomfort ) #uw3( discomfort ) #uw3( heartheart inconvenience ) #uw3( inconvenience ) #uw3( heartheart trouble ) #uw3( trouble ) #uw3( heartheart vexation ))) vexation )))

#uw3 = proximity operator for three intervening words, #uw3 = proximity operator for three intervening words, free word orderfree word order

i.e. forming all proximity combinations as synonym sets.i.e. forming all proximity combinations as synonym sets.

Page 20: Managing Morphologically Complex Languages in Information Retrieval

Compound Processing, 4Compound Processing, 4 No clear benefits seen from using No clear benefits seen from using

proximity combinations.proximity combinations. We did neither observe a great We did neither observe a great

effect in changing the proximity effect in changing the proximity operator (OD vs. UW)operator (OD vs. UW)

Some monolingual results follow Some monolingual results follow (Airio 2006)(Airio 2006)

Page 21: Managing Morphologically Complex Languages in Information Retrieval

Language Index type Average P %

Diff to Baseline

English Inflected 43.4 Lemmas 45.6 2.2 Stemmed 46.3 2.9 Finnish Inflected 31.0 Lemma & decomp 50.5 19.5 Lemmas 47.0 16.0 Stemmed 48.5 17.5 Swedish Inflected 30.2 Lemma & decomp 38.8 8.6 Lemmas 31.4 1.2 Stemmed 33.5 3.3 German Inflected 30.2 Lemma & decomp 36.2 6.0 Lemmas 31.9 1.7 Stemmed 35.7 5.5

InQuery/CLEF/Raw&TWOL&Porter

Page 22: Managing Morphologically Complex Languages in Information Retrieval

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Finnish

English

Swedish

Morphological complexity increases

Page 23: Managing Morphologically Complex Languages in Information Retrieval

Hedlund 2002Hedlund 2002 Compound translation as compounds:Compound translation as compounds:

47 German CLEF 2001 topics, English docs collection.

comprehensive dictionary (many compounds) vs. small dict (no compounds)

mean AP 34.7% vs. 30.4% dictionary matters ...

Alternative approach: if not translatable, split and translate components

Page 24: Managing Morphologically Complex Languages in Information Retrieval

CLEF Ger -> EngCLEF Ger -> Eng

1. best manually translated 0,4465

2. large dict, no comp splitting 0,3520

3. limited dict, no comp splitting 0,3057

4. large dictionary & comp splitting 0,3830

5. limited dict & comp splitting 0,3547

InQuery/UTAClir/CLEF/Duden/TWOL/UW 5+n

Page 25: Managing Morphologically Complex Languages in Information Retrieval

CLIR Findings: Airio CLIR Findings: Airio 20062006

Target language

Index type Average P %

Diff to Baseline%

Finnish Lemma & decomp 35.5 Lemmas 29.0 -6.5 Stemmed 20.8 -14.7 Swedish Lemma & decomp 27.1 Lemmas 17.4 -9.7 Stemmed 19.0 -8.1 German Lemma & decomp 31.0 Lemmas 26.4 -4.6 Stemmed 25.7 -5.3

English ->

InQuery/UTAClir/CLEF/GlobalDix/TWOL&Porter

Page 26: Managing Morphologically Complex Languages in Information Retrieval

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Eng->Ger

QuickTime™ and a decompressorare needed to see this picture.

Eng->Swe

QuickTime™ and a decompressor

are needed to see this picture.

Eng->Fin

Page 27: Managing Morphologically Complex Languages in Information Retrieval

AgendaAgenda 1. Introduction1. Introduction 2. Reductive Methods2. Reductive Methods 3. Compounds3. Compounds 4. Generative Methods4. Generative Methods 5. Query Structures5. Query Structures 6. OOV Words6. OOV Words 7. Conclusion7. Conclusion

Page 28: Managing Morphologically Complex Languages in Information Retrieval

4. Generative Methods4. Generative MethodsVariationhandling

GenerativeMethods

Word FormGeneration

Infl stemGeneration

Lemmatiz-ationStemming

GeneratingAll Forms

InflectionalStems, ench

Rule-based

Rule-based

FCGInflectionalStems

Rules +Dict

Rules +Dict

ReductiveMethods

Page 29: Managing Morphologically Complex Languages in Information Retrieval

Generative Methods: inf Generative Methods: inf stemsstems

Instead of normalization, generate Instead of normalization, generate inflectional stems for an inflectional inflectional stems for an inflectional index.index. then using stems harvest full forms then using stems harvest full forms

from the indexfrom the index long queries ...long queries ...

Page 30: Managing Morphologically Complex Languages in Information Retrieval

... OR ...... OR ... Instead of normalization, generate Instead of normalization, generate

full inflectional forms for an full inflectional forms for an inflectional index.inflectional index. Long queries? Sure!Long queries? Sure! Sounds absolutely crazy ...Sounds absolutely crazy ...

Page 31: Managing Morphologically Complex Languages in Information Retrieval

... BUT!... BUT! Are morphologically complex Are morphologically complex

languages languages that that complex in IR complex in IR in in practicepractice??

Instead of full form generation, only Instead of full form generation, only generate generate sufficientsufficient forms -> FCG forms -> FCG

In Finnish, 9-12 forms cover 85% of In Finnish, 9-12 forms cover 85% of all occurrences of nounsall occurrences of nouns

Page 32: Managing Morphologically Complex Languages in Information Retrieval

Kettunen & al 2006: Kettunen & al 2006: FinnishFinnish

IRIR MAP for relevance level MAP for relevance levelMethodMethod Liberal Liberal NormalNormal StringentStringentTWOLTWOL 37.8 37.8 35.035.024.124.1FCG12FCG12 32.7 32.7 30.0 30.0 21.4 21.4FCG6FCG6 30.9 30.9 28.0 28.0 21.0 21.0SnowballSnowball 29.8 29.8 27.7 27.7 20.0 20.0RawRaw 19.619.618.918.912.412.4

... monolingual ...

Page 33: Managing Morphologically Complex Languages in Information Retrieval

Kettunen & al 2007: Kettunen & al 2007: Other LangsOther Langs

IRIR MAP for Language MAP for LanguageMethodMethod Swe Swe GerGer RusRusTWOLTWOL 32.6 32.6 39.739.7....FCGFCG 30.6 30.6 /4/4 38.0 38.0 /4/4 32.7 /2 32.7 /2FCGFCG 29.1 29.1 /2/2 36.8 36.8 /2/2 29.2 /6 29.2 /6SnowballSnowball 28.5 28.5 39.1 39.1 34.734.7RawRaw 24.0 24.0 35.935.929.8 29.8

Results for long queries ... monolingual ...

Page 34: Managing Morphologically Complex Languages in Information Retrieval

CLIR Findings: Airio CLIR Findings: Airio 20082008

Language Language pairspairs

Raw Raw transltransl

Fi-FCG_9Fi-FCG_9

Sv-FCG_4Sv-FCG_4

Fi-FCG_12Fi-FCG_12

Sv-FCG_7Sv-FCG_7Lemma-Lemma-

tizedtized

Fin -> Fin -> EngEng 11.211.2 32.432.4 32.532.5 39.639.6

Fin -> Fin -> SweSwe 14.314.3 22.622.6 23.923.9 35.235.2

Eng -> Eng -> SweSwe 18.118.1 25.125.1 27.327.3 34.134.1

Swe -> Swe -> FinFin 11.711.7 28.028.0 27.927.9 37.637.6

Page 35: Managing Morphologically Complex Languages in Information Retrieval

AgendaAgenda 1. Introduction1. Introduction 2. Reductive Methods2. Reductive Methods 3. Compounds3. Compounds 4. Generative Methods4. Generative Methods 5. Query Structures5. Query Structures 6. OOV Words6. OOV Words 7. Conclusion7. Conclusion

Page 36: Managing Morphologically Complex Languages in Information Retrieval

5. Query Structures5. Query Structures Translation ambiguity such as ...Translation ambiguity such as ...

Homonymy: homophony, homographyHomonymy: homophony, homography Examples: platform, bank, bookExamples: platform, bank, book

Inflectional homographyInflectional homography Examples: train, trains, training Examples: train, trains, training Examples: book, books, booking Examples: book, books, booking

PolysemyPolysemy Examples: back, trainExamples: back, train

... a problem in CLIR.... a problem in CLIR.

Page 37: Managing Morphologically Complex Languages in Information Retrieval

Ambiguity ResolutionAmbiguity Resolution MethodsMethods

Part-of-speech tagging (e.g. Ballesteros Part-of-speech tagging (e.g. Ballesteros & Croft ‘98)& Croft ‘98)

Corpus-based methods Ballesteros & Corpus-based methods Ballesteros & Croft ‘96; ‘97; Chen & al. ‘99)Croft ‘96; ‘97; Chen & al. ‘99) Query ExpansionQuery Expansion CollocationsCollocations

Query structuring - the Pirkola Method Query structuring - the Pirkola Method (1998)(1998)

Page 38: Managing Morphologically Complex Languages in Information Retrieval

Query Structuring Query Structuring From weak to From weak to

strong query strong query structures by structures by recognition of ...recognition of ... conceptsconcepts expression weightsexpression weights phrases, compoundsphrases, compounds

Queries may be Queries may be combined ... query combined ... query fusionfusion

Concepts?

no yes

Phrases ?

Weighting ? Weighting ?

no yes no yes

Phrases ?

no yes no yes

#sum(a b c d e)

#wsum(1 3 #syn(a #3(b c)) 1 #syn(d e))

~~ ~~

~~ ~~

Page 39: Managing Morphologically Complex Languages in Information Retrieval

Structured Queries in Structured Queries in CLIRCLIR

CLIR performance (Pirkola 1998, 1999)CLIR performance (Pirkola 1998, 1999) English baselines, manual Finnish English baselines, manual Finnish

translationstranslations Automatic dictionary translation FIN -> ENGAutomatic dictionary translation FIN -> ENG

natural language queries (NL) vs. concept queries natural language queries (NL) vs. concept queries (BL)(BL)

structured vs. unstructured translationsstructured vs. unstructured translations single words (NL/S) vs. phrases marked (NL/WP)single words (NL/S) vs. phrases marked (NL/WP) general and/or special dictionary translationgeneral and/or special dictionary translation

500.000 document TREC subcollection500.000 document TREC subcollection probabilistic retrieval (InQuery)probabilistic retrieval (InQuery) 30 health-related requests30 health-related requests

Page 40: Managing Morphologically Complex Languages in Information Retrieval

The Pirkola MethodThe Pirkola Method All translations of all senses All translations of all senses

provided by the dictionary are provided by the dictionary are incorporated in the queryincorporated in the query

All translations of each source All translations of each source language word are combined by the language word are combined by the synonym operator, synonym groups synonym operator, synonym groups by #and or #sumby #and or #sum this effectively provides disambiguationthis effectively provides disambiguation

Page 41: Managing Morphologically Complex Languages in Information Retrieval

An ExampleAn Example Consider the Finnish natural language Consider the Finnish natural language

query: query: lääke sydänvaiva [= medicine lääke sydänvaiva [= medicine

heart_problem]heart_problem] Sample Sample English CLIR queryEnglish CLIR query::

#sum( #sum( #syn( medication drug )#syn( medication drug ) heart heart #syn( ailment, complaint, discomfort, #syn( ailment, complaint, discomfort, inconvenience, trouble, vexation ) inconvenience, trouble, vexation ) ))

Each source word forming a synonym Each source word forming a synonym setset

Page 42: Managing Morphologically Complex Languages in Information Retrieval

Query Translation Test Query Translation Test Set-upSet-up

InQueryTRECUnix-server

Translated Finnish RequestEnglish Request

Finnish NL Query Finnish BL Query

General Dict Med. Dict.

Baseline Queries Translated English Queries

General Dict Med. Dict.

Page 43: Managing Morphologically Complex Languages in Information Retrieval

Unstructured NL/S Unstructured NL/S QueriesQueries

Only 38% of

the average baselineprecision

(sd&gd)

Baseline

#sum(tw11, tw12, ... , tw21, tw22, ... twn1, ... , twnk)

Page 44: Managing Morphologically Complex Languages in Information Retrieval

0

510

152025

3035

40

10 20 30 40 50 60 70 80 90 100

Precision

Recall

GD SD --> GDSD and GD BL

Baseline

77% of the

average baselineprecision (sd & gd)

Structure doubles

precision in all cases#and(#syn(tw11, tw12, ... ), #syn(tw21, tw22, ...), #syn( twn1, ..., twnk))

Structured Queries w/ Structured Queries w/ Special DictionarySpecial Dictionary

Page 45: Managing Morphologically Complex Languages in Information Retrieval

Query Structuring, More Query Structuring, More ResultsResults

CLEF CLEF Topic set 2000 Topic set 2000 (N = 33)(N = 33)

Unstructured Unstructured queriesqueries

no s-gramsno s-grams

Structured Structured queriesqueries

no s-gramsno s-grams

Change %Change %

Finnish - EngFinnish - Eng 0.16090.1609 0.21500.2150 33.633.6

German - EngGerman - Eng 0.20970.2097 0.26390.2639 25.825.8

Swedish - EngSwedish - Eng 0.20150.2015 0.22420.2242 11.311.3

Topic set 2001Topic set 2001 (N = 47)(N = 47)

Finnish - EngFinnish - Eng 0.24070.2407 0.34430.3443 43.043.0

German - EngGerman - Eng 0.32420.3242 0.38300.3830 18.118.1

Swedish - EngSwedish - Eng 0.34660.3466 0.34650.3465 0.00.0

Page 46: Managing Morphologically Complex Languages in Information Retrieval

Transit CLIR – Query Transit CLIR – Query StructuresStructures

Average precision for the transitive, bilingual and monolingual runs of CLEF 2001 topics (N = 50)

Page 47: Managing Morphologically Complex Languages in Information Retrieval

Transitive CLIR Results, Transitive CLIR Results, 22

2 0 0 1 T o p i c s

0

1 0

2 0

3 0

4 0

5 0

6 0

7 0

8 0

0 2 0 4 0 6 0 8 0 1 0 0

R e c a l l

Precision

s w e - f i - e n g

s w e - e n g

f i n - s w e - e n

f i n - e n g

m o n o l e n g

Page 48: Managing Morphologically Complex Languages in Information Retrieval

Transitive CLIR Transitive CLIR EffectivenessEffectiveness

RegularRelevance

MAPN=35

% mono-lingual perfor-mance

% direct performance

Swe-Eng-Fin 21.8 59** 68**

Eng-Swe-Fin 27.5 75    88   

Ger-Eng-Fin 24.3 66** 83   

Ger-Swe-Fin 29.3 79    100    

Mean 25.7 70   85  

Lehtokangas & al 2008

Page 49: Managing Morphologically Complex Languages in Information Retrieval

TransCLIR + pRF TransCLIR + pRF effectivenesseffectiveness

RegularRelevance

MAPN=35

% monolingual +

pRF %

monolingual%

direct%

pRF exp direct

Swe-Eng-Fin 28.7 68** 78 90 78*

Eng-Swe-Fin 32.3 76* 88 103 87 

Ger-Eng-Fin 33.6 79   91 115 104   

Ger-Swe-Fin 34.5 81  94 118 107  

Mean 32.3 76   88 107 94 

Page 50: Managing Morphologically Complex Languages in Information Retrieval

AgendaAgenda 1. Introduction1. Introduction 2. Reductive Methods2. Reductive Methods 3. Compounds3. Compounds 4. Generative Methods4. Generative Methods 5. Query Structures5. Query Structures 6. OOV Words6. OOV Words 7. Conclusion7. Conclusion

Page 51: Managing Morphologically Complex Languages in Information Retrieval

6. OOV Words6. OOV Words Low coverage -- non-translated wordsLow coverage -- non-translated words

Domain-specific terms in general dictionariesDomain-specific terms in general dictionaries e.g. dystrophye.g. dystrophy Covered in domain-specific dictionariesCovered in domain-specific dictionaries

Compound wordsCompound words Proper names: persons, geographical names, …Proper names: persons, geographical names, … Often central for query effectivenessOften central for query effectiveness

Large dictionaries solution? BUT:Large dictionaries solution? BUT: Excessive number of senses and words for eachExcessive number of senses and words for each Increases ambiguity problemsIncreases ambiguity problems and still many words remain OOVsand still many words remain OOVs

Page 52: Managing Morphologically Complex Languages in Information Retrieval

OOV ProcessingOOV Processing Proper names are often spelled differently Proper names are often spelled differently

between languagesbetween languages Transliteration variationTransliteration variation Brussels, Bryssel; Chernobyl, Tshernobyl; Brussels, Bryssel; Chernobyl, Tshernobyl;

Chechnya, Tsetsenia Chechnya, Tsetsenia Non-translated keys are often used as such in Non-translated keys are often used as such in

CLIR queries -- simplistic approach, not CLIR queries -- simplistic approach, not optimaloptimal

In some languages, proper names may inflectIn some languages, proper names may inflect In Finnish ”in Cairo” = KairoIn Finnish ”in Cairo” = Kairossassa

Page 53: Managing Morphologically Complex Languages in Information Retrieval

Approximate String Approximate String MatchingMatching

Means for Means for correcting spelling errorscorrecting spelling errors matching variant word forms (e.g. matching variant word forms (e.g.

proper names between languages)proper names between languages) MethodsMethods

edit-distance, n-grams, edit-distance, n-grams, skip-gramsskip-grams soundex, phonix, based of phone soundex, phonix, based of phone

similaritysimilarity transliterationtransliteration

Page 54: Managing Morphologically Complex Languages in Information Retrieval

Sample digram MatchingSample digram Matching Sample words Bryssel, Brussels, Bruxelles Sample words Bryssel, Brussels, Bruxelles

Bryssel --> Bryssel --> N1= {br, ry, ys, ss, se, el}N1= {br, ry, ys, ss, se, el} Brussels --> Brussels --> N2 = {br, ru, us, ss, se, el, ls}N2 = {br, ru, us, ss, se, el, ls} Bruxelles --> Bruxelles --> N3 = {br, ru, ux, xe, el, ll, le, N3 = {br, ru, ux, xe, el, ll, le,

es}es} sim(N1, N2) = | N1 sim(N1, N2) = | N1 N2| / | N1 N2| / | N1 N2| N2| = 4 / 9 = 0.444= 4 / 9 = 0.444 sim(N1, N3) = 1/6 = 0.167sim(N1, N3) = 1/6 = 0.167

Page 55: Managing Morphologically Complex Languages in Information Retrieval

Skip-gramsSkip-grams Generalizing n-gram matchingGeneralizing n-gram matching The strings to be compared are split The strings to be compared are split

into substrings of length ninto substrings of length n Skipping characters is allowedSkipping characters is allowed Substrings produced using various skip Substrings produced using various skip

lengthslengths n-grams remain a special case: no skipsn-grams remain a special case: no skips ((Pirkola & Keskustalo & al 2002Pirkola & Keskustalo & al 2002))

Page 56: Managing Morphologically Complex Languages in Information Retrieval

An ExampleAn Example katalyyttinen – catalytic katalyyttinen – catalytic

skip-0:skip-0: {_k, ka, at, ta, al, ly, yy, yt, tt, ti, in, ne, en, n_} {_k, ka, at, ta, al, ly, yy, yt, tt, ti, in, ne, en, n_} {_c, cs, at, ta, al, ly, yt, ti, ic, c_}{_c, cs, at, ta, al, ly, yt, ti, ic, c_}

skip-1:skip-1: {_a, kt, aa, tl, ay, ly, yt, yt, ti, tn, ie, nn, e_} {_a, kt, aa, tl, ay, ly, yt, yt, ti, tn, ie, nn, e_} {_a, ct, aa, tl, ay, lt, yi, tc, i_}{_a, ct, aa, tl, ay, lt, yi, tc, i_}

skip-2:skip-2: {_t, ka, al, ty, ay, lt, yt, yi, tn, te, in, n_}{_t, ka, al, ty, ay, lt, yt, yi, tn, te, in, n_} {_t, ca, al, ty, at, lc, yc, t_}{_t, ca, al, ty, at, lc, yc, t_}

calculate similarity over different skip-gram calculate similarity over different skip-gram setssets

Page 57: Managing Morphologically Complex Languages in Information Retrieval

Skip-gram effectivenessSkip-gram effectiveness Several Several nn-gram methods tested-gram methods tested

nn-grams & skip-grams-grams & skip-grams with & without paddingwith & without padding

The relative improvement for some The relative improvement for some ss-grams -grams vs. vs. nn-grams in X -> Finnish name matching: -grams in X -> Finnish name matching:

18.2% (Eng medical) 18.2% (Eng medical) 49.7% (Eng geograph) 49.7% (Eng geograph) 20.7% (Ger geograph) 20.7% (Ger geograph) 17.1% (Swe geograph)17.1% (Swe geograph) Statistically significant, Statistically significant, pp = 0.01-0.001 = 0.01-0.001

Page 58: Managing Morphologically Complex Languages in Information Retrieval

CLIR Findings: Järvelin & CLIR Findings: Järvelin & al 2008al 2008

Closely related languagesClosely related languages Norwegian and SwedishNorwegian and Swedish

Translation by string matching aloneTranslation by string matching alone no dictionary at all, no other vocabulary no dictionary at all, no other vocabulary

sourcesource encouraging resultsencouraging results

Page 59: Managing Morphologically Complex Languages in Information Retrieval

CLIR Findings: Järvelin & CLIR Findings: Järvelin & al 2008al 2008

QuickTime™ and a decompressor

are needed to see this picture.

Page 60: Managing Morphologically Complex Languages in Information Retrieval

CLIR Findings: Airio CLIR Findings: Airio 20082008

Language pairsLanguage pairs Raw Raw transltransl

Fi-sgram_9Fi-sgram_9

Sv-sgram_4Sv-sgram_4

Fi-sgram_12Fi-sgram_12

Sv-sgram_7Sv-sgram_7

Lemma-Lemma-tizedtized

Fin -> Fin -> EngEng

11.211.2 29.229.2 31.031.0 39.639.6

Eng -> Eng -> SweSwe

18.118.1 25.725.7 25.325.3 34.134.1

Swe -> Swe -> Fin-Fin-

11.711.7 26.126.1 26.726.7 37.637.6

Fin -> Fin -> SweSwe

14.314.3 26.726.7 22.622.6 35.235.2

*) target index not normalized

*) *) *)

Page 61: Managing Morphologically Complex Languages in Information Retrieval

Rule-based Rule-based TransliterationTransliteration

Between languages, the originally ‘same’ Between languages, the originally ‘same’ words have different spellings:words have different spellings: Proper names, technical termsProper names, technical terms

Transliteration rules are based on regular Transliteration rules are based on regular variations in the spelling of equivalent word variations in the spelling of equivalent word forms between languagesforms between languages

construction construction - konstruktio, - konstruktio, c -> kc -> k somatology somatology - somatologia, - somatologia, y -> iay -> ia universiti universiti - university, - university, i -> y i -> y

Transliteration rules mined in a bilingual Transliteration rules mined in a bilingual word listword list

Their frequency and reliability are recordedTheir frequency and reliability are recorded

Page 62: Managing Morphologically Complex Languages in Information Retrieval

Sample Rules German-Sample Rules German-to-Englishto-English

SourceSource TargetTarget LocationLocationConfidence Confidence

string string stringstring of the rule of the rule factorfactor

ekt ekt ect ect middlemiddle 89.289.2 m m ma ma endend    21.121.1 akt akt act act middlemiddle 86.786.7 ko ko coco beginningbeginning 80.780.7

Page 63: Managing Morphologically Complex Languages in Information Retrieval

TRT Translation

OOVSourceWord

TranslationCandidates

TRT RuleBase

Skip-gram matching

Target Index

Identified Target

Word(s)

TRT RuleProduction

Precision

Recall

TRT Translation – 2 TRT Translation – 2 StepsSteps

1 2

Evaluation

Page 64: Managing Morphologically Complex Languages in Information Retrieval

TRT – rule collectionsTRT – rule collections Rule collectionRule collection # Rules# Rules # #

RulesRules CF>4.0%, CF>4.0%,

Fr. >2Fr. >2 Spanish-EnglishSpanish-English 88008800 12951295 Spanish-GermanSpanish-German 54125412 984984 Spanish-FrenchSpanish-French 97249724 14301430 German-EnglishGerman-English 86098609 12191219 French-EnglishFrench-English 98739873 11701170

Page 65: Managing Morphologically Complex Languages in Information Retrieval

TRT EffectivenessTRT Effectiveness Finnish-to-English translation (HCF)Finnish-to-English translation (HCF)

Term typeTerm type Digrams TRT + digrams Digrams TRT + digrams % chg% chg

Bio terms Bio terms 61,461,4 72,0 72,0 +17,3+17,3 Place namesPlace names 30,030,0 35,935,9 +19,7+19,7 EconomicsEconomics 32,232,2 38,038,0 +18,0+18,0 TechnologyTechnology 31,631,6 53,753,7 +69,9+69,9 MiscellaneousMiscellaneous 33,833,8 40,640,6 +20,1+20,1

Page 66: Managing Morphologically Complex Languages in Information Retrieval

A Problem with TRTA Problem with TRT TRT gives many target word forms for a source TRT gives many target word forms for a source

word (even tens of thousands) but does not word (even tens of thousands) but does not indicate the correct oneindicate the correct one

For example, in Spanish-English translation For example, in Spanish-English translation TRT gives the following forms of for a Spanish TRT gives the following forms of for a Spanish word biosintesis:word biosintesis: biosintesis, biosintessis, biosinthesis, biosinthessis, biosintesis, biosintessis, biosinthesis, biosinthessis,

biosyntesis, biosyntessis, biosynthesis, biosynthessisbiosyntesis, biosyntessis, biosynthesis, biosynthessis To identify the correct equivalent, we use FITETo identify the correct equivalent, we use FITE

frequency-based identification of equivalentsfrequency-based identification of equivalents

Page 67: Managing Morphologically Complex Languages in Information Retrieval

Sample Sample Frequency Frequency PatternPattern

The frequency pattern associated with the The frequency pattern associated with the English target word candidatesEnglish target word candidates for the Spanish for the Spanish wordword biosintesis:biosintesis:

Target candidateTarget candidate Doc FreqDoc Freq biosynthesisbiosynthesis 2 230 0002 230 000 biosintesisbiosintesis 909 909 biosyntesis biosyntesis 634 634 biosinthesisbiosinthesis 255 255 biosynthessisbiosynthessis 3 3 biosintessisbiosintessis 0 0 ...... ... ...

Page 68: Managing Morphologically Complex Languages in Information Retrieval

TRT Translation

OOVSourceWord

TranslationCandidates

TRT RuleBase

FITEIdentification

FrequencyStatistics

Identified TargetWord

TRT RuleProduction

Frequencypattern Ok

Relative Frequency Ok

LengthDifference Ok

NativeSourceWord

FITE-TRT TranslationFITE-TRT Translation

1 2

Precision

Recall

Evaluation

Page 69: Managing Morphologically Complex Languages in Information Retrieval

FITE-TRT effectivenessFITE-TRT effectiveness Spanish-English biological and Spanish-English biological and

medical spelling variants (n = 89)medical spelling variants (n = 89) Finnish-English biological and Finnish-English biological and

medical spelling variants (n = 89)medical spelling variants (n = 89) Translation toward English by TRTTranslation toward English by TRT English equivalent identification by English equivalent identification by

FITEFITE

Page 70: Managing Morphologically Complex Languages in Information Retrieval

FITE-TRT EffectivenessFITE-TRT Effectiveness Translation Recall and Precision for bio-Translation Recall and Precision for bio-

terms terms

Source Source TranslationTranslation TranslatTranslat LanguageLanguage Recall % Recall % Prec % Prec % SpanishSpanish Web Web 91.0 91.0 98.8 98.8 Freq ListFreq List 82.0 82.0 98.8 98.8 Finnish Finnish Web Web 71.9 71.9 97.0 97.0 Freq List Freq List 67.4 67.4 97.3 97.3

Page 71: Managing Morphologically Complex Languages in Information Retrieval

FITE-TRT effectivenessFITE-TRT effectiveness UTACLIR UTACLIR TREC Genomics Track TREC Genomics Track

2004 topics 2004 topics Spanish-English actual OOV words (n = Spanish-English actual OOV words (n =

93+5)93+5) Finnish-English actual OOV words (n = Finnish-English actual OOV words (n =

48+5)48+5) Translation toward English by TRTTranslation toward English by TRT English equivalents identification by English equivalents identification by

FITEFITE

Page 72: Managing Morphologically Complex Languages in Information Retrieval

FITE-TRT EffectivenessFITE-TRT Effectiveness Translation Recall and Precision for actual Translation Recall and Precision for actual

OOV words OOV words

Source Source TranslationTranslation Translation TranslationLanguageLanguage Recall % Recall % Precision % Precision % SpanishSpanish Web Web 89.2 89.2 97.6 97.6 Freq ListFreq List 87.1 87.1 97.6 97.6 Finnish Finnish Web Web 72.9 72.9 97.2 97.2 Freq List Freq List 79.2 79.2 95.0 95.0

Page 73: Managing Morphologically Complex Languages in Information Retrieval

Genomics CLIR Genomics CLIR experimentexperiment

German - English CLIR in genomicsGerman - English CLIR in genomics TREC Genomics Track data, 50 TREC Genomics Track data, 50

topicstopics (20 training + 30 testing)(20 training + 30 testing) MedLine 4.6 M medical abstractsMedLine 4.6 M medical abstracts Baseline: raw German as such + Baseline: raw German as such +

Dict translDict transl Test queries - FITE-TRTTest queries - FITE-TRT

Page 74: Managing Morphologically Complex Languages in Information Retrieval

Genomics CLIR Genomics CLIR experimentexperiment

QuickTime™ and a decompressor

are needed to see this picture.

Page 75: Managing Morphologically Complex Languages in Information Retrieval

AgendaAgenda 1. Introduction1. Introduction 2. Reductive Methods2. Reductive Methods 3. Compounds3. Compounds 4. Generative Methods4. Generative Methods 5. Query Structures5. Query Structures 6. OOV Words6. OOV Words 7. Conclusion7. Conclusion

Page 76: Managing Morphologically Complex Languages in Information Retrieval

8. Conclusions8. Conclusions Monolingual retrievalMonolingual retrieval

morphological complexity morphological complexity who owns the index?who owns the index? reductive and generative approachesreductive and generative approaches skewed distributions; surprisingly little may skewed distributions; surprisingly little may

be enoughbe enough compound handling perhaps not critical compound handling perhaps not critical

in monolingual retrievalin monolingual retrieval

Page 77: Managing Morphologically Complex Languages in Information Retrieval

More ConclusionsMore Conclusions Cross-language retrievalCross-language retrieval

Query structuring simple and effectiveQuery structuring simple and effective Closely related languages / dialectsClosely related languages / dialects

simple language independent techniques maybe simple language independent techniques maybe enoughenough

Observe what happens between compounding vs Observe what happens between compounding vs isolating languagesisolating languages

compound splitting seems essentialcompound splitting seems essential Inflected form indicesInflected form indices

skip-grams or FCG after translation may be a solutionskip-grams or FCG after translation may be a solution Specific domains: Dom dictionaries or aligned Specific domains: Dom dictionaries or aligned

corpcorp

Page 78: Managing Morphologically Complex Languages in Information Retrieval

Thanks!Thanks!