The use of ontologies in Natural Language Engineering Pattern Recognition & Artificial Intelligence group Dept. of Computation and Information System Polytechnic

The use of ontologies in The use of ontologies in Natural Language EngineeringNatural Language Engineering

Pattern Recognition & Artificial IntelligencePattern Recognition & Artificial Intelligence group group

Dept. of Computation and Information SystemDept. of Computation and Information System

Polytechnic University of Valencia, SpainPolytechnic University of Valencia, Spain

Paolo RossoPaolo Rosso

[email protected]@dsic.upv.es

Natural Language Engineering (NLE)Natural Language Engineering (NLE)

Ph.D. students:Ph.D. students:

Davide BuscaldiDavide Buscaldi

David Pinto David Pinto

Rafael GuzmánRafael Guzmán

Yassine BenajibaYassine Benajiba

(Natalia Ponomareva)(Natalia Ponomareva)

+ José Manuel Gómez+ José Manuel Gómez

On-going collaborations with:On-going collaborations with: INAOE (M. Montes), NPI (M. Alexandrov), INAOE (M. Montes), NPI (M. Alexandrov),

BUAP (H. Jiménez): MexicoBUAP (H. Jiménez): Mexico

University of San Luís (M. Errecalde): ArgentinaUniversity of San Luís (M. Errecalde): Argentina

University of Genova: ItalyUniversity of Genova: Italy

1.1. The WordNet ontologyThe WordNet ontology2.2. Conceptual Density for Word Sense DisambiguationConceptual Density for Word Sense Disambiguation3.3. Semantic (Geo) Information Retrieval and Text CategorizationSemantic (Geo) Information Retrieval and Text Categorization4.4. Semantic clusters of very short narrow-domain textsSemantic clusters of very short narrow-domain texts5.5. Lexical pattern extraction: mining the web with semantic infoLexical pattern extraction: mining the web with semantic info6.6. An ontology for Question Answering and vice versa: An ontology for Question Answering and vice versa:

OntotripleQAOntotripleQA7.7. Cluster analysis of transcribed spoken dialoguesCluster analysis of transcribed spoken dialogues8.8. The Clasitext ontology. Applications with ontologiesThe Clasitext ontology. Applications with ontologies

9.9. Semantic relatedness of conceptsSemantic relatedness of concepts10.10. Ontology matching: meta ontology or machine learning Ontology matching: meta ontology or machine learning

approach?approach?

11.11. The case study of biomedical ontologies: semantic relatedness The case study of biomedical ontologies: semantic relatedness of concepts of the different ontologiesof concepts of the different ontologies

12.12. Lab session: CALD, WordNet vs. Genia, GenOntology, MeSHLab session: CALD, WordNet vs. Genia, GenOntology, MeSH

1. The WordNet ontology1. The WordNet ontology

Basic lexical relations:

1. Synonymy

2. Polysemy: related meanings; e.g. bank, blood bank

3. Homonymy: not related meaning; e.g. bank (river) and bank (financial institution)

4. Hyponymy: sub-class (is-a relation), e.g. car is-a vehicle

5. Hypernymy: vice versa

6. Antonymy

7. Meronymy: part of (e.g. parts of a car)

8. Holonymy: vice versa (e.g. a car is_composed of)


Dictionaries and thesaurus: are they enough?are they enough?

1.1. RogetRoget thesaurus thesaurus

2.2. LongmanLongman Dictionary of Contemporary English Dictionary of Contemporary English ((LODCELODCE))

3.3. CambridgeCambridge Advanced Learner’s Dictionary Advanced Learner’s Dictionary ((CALDCALD): ): http://dictionary.cambridge.org

http://dictionary.cambridge.org/

1. The WordNet ontology1. The WordNet ontology We need an We need an ontology:

• word definitionword definition• examples of usage examples of usage • + lexical relations between words! + lexical relations between words!

The The WordNet (WN) ontology:WordNet (WN) ontology:• external lexical resource developed at Princeton external lexical resource developed at Princeton

University (G.A. Miller)University (G.A. Miller)• based on based on synsets synsets ((setset of of synsynonymonymss defining a lexical defining a lexical

conceptconcept))• not a poor ontology with only the not a poor ontology with only the is-ais-a relation (tree): relation (tree):

synsets are connected by various semantic relations synsets are connected by various semantic relations (graph)(graph)

• a a polysemic lexemepolysemic lexeme belongs to more synsets belongs to more synsets• no difference between no difference between polysemypolysemy and and homonymyhomonymy


• lexical categorieslexical categories: : – (hierarchy) nouns(hierarchy) nouns– (3-level hierarchy) verbs(3-level hierarchy) verbs– adjective and adverbsadjective and adverbs

• language: language: EnglishEnglish

# nouns: # nouns: 114.648114.648 # synsets: # synsets: 79.68979.689

# verbs: # verbs: 11.306 11.306 13.50813.508

# adjectives: # adjectives: 21.436 21.436 18.56318.563

# adverbs: # adverbs: 4.669 4.669 3.6643.664

152.059152.059 115.424115.424

• fine fine granularitygranularity (version 2.0) : too much? (version 2.0) : too much? • mapping synsets onto the mapping synsets onto the WordNet DomainsWordNet Domains (IRST-Trento, (IRST-Trento,

B. Magnini)B. Magnini)


Mapping of the WN synsets onto Mapping of the WN synsets onto WordNet DomainsWordNet Domains

(200 categories): the (200 categories): the FactotumFactotum category problem… category problem…


lexical relationslexical relations between synsets ( between synsets (synonymysynonymy): ): • hypernymyhypernymy, , hyponymyhyponymy, , meronymymeronymy, , holonymyholonymy• entailmententailment (verbs): e.g. snore-> sleep (verbs): e.g. snore-> sleep• causecause (verbs): synset A causes synset B (verbs): synset A causes synset B

e.g. give – havee.g. give – have• pertainympertainym (adj-nouns) e.g. electrical – electricity (adj-nouns) e.g. electrical – electricity• attributeattribute (adj-nouns): e.g. small – size (adj-nouns): e.g. small – size• antonymantonym (adj): e.g. small - big (adj): e.g. small - big• similar sensesimilar sense (adj): e.g. far – distant (adj): e.g. far – distant• categorycategory: e.g. sample – statistics: e.g. sample – statistics• regionregion: e.g. French revolution – France: e.g. French revolution – France• useuse: e.g. gone - euphemism: e.g. gone - euphemism


Access to WordNet (data base of lexical relations): Access to WordNet (data base of lexical relations):

a.a. library library functionsfunctions

b.b. on-lineon-line (browser): (browser): http://www.cogsci.princeton.edu/~wn/

http://www.cogsci.princeton.edu/~wn/




a.a.

Part Of Speech (POS): noun, verb, adj, adv

index fileindex file: : index.posindex.pos

e.g.e.g. plantplant n 4 5 @ n 4 5 @ #m %s %p 03138429 #m %s %p 03138429 0000886400008864 04539420 07480098 04539420 07480098

n: nounn: noun

4 senses4 senses

5 lexical relations: hypernymy, hyponymy, 5 lexical relations: hypernymy, hyponymy, mmember_of, ember_of, ssubstance_of, ubstance_of, ppart_of art_of

synset addressessynset addresses


a.a.

data filedata file: : data.posdata.pos

e.g. e.g. 0000886400008864 n 03 plant 0 flora 0 plant_life 0 027 n 03 plant 0 flora 0 plant_life 0 027 @00002086 n 0000 … @00002086 n 0000 … 09463675 n 0000 | 09463675 n 0000 |

a living organism lacking the power of locomotiona living organism lacking the power of locomotion

n: nounn: noun

3: number de lexemes3: number de lexemes (0 separator) (0 separator)

27: number of lexical relations27: number of lexical relations

1. The WordNet ontology1. The WordNet ontology b.b. e.g. e.g. bassbass (noun): 8 senses (noun): 8 senses

1.1. bassbass -- (the lowest part of the musical range) -- (the lowest part of the musical range)2.2. bassbass, bass part -- (the lowest part in polyphonic , bass part -- (the lowest part in polyphonic

music)music)3.3. bassbass, basso -- (an adult male singer with the lowest , basso -- (an adult male singer with the lowest

voicevoice4.4. sea bass, sea bass, bassbass -- (the lean flesh of a saltwater fish of -- (the lean flesh of a saltwater fish of

the Serranidae family )the Serranidae family )5.5. freshwater bass, freshwater bass, bassbass -- (any of various North -- (any of various North

American freshwater fish with lean flesh (especially American freshwater fish with lean flesh (especially of the genus Micropterus))of the genus Micropterus))

6.6. bassbass, bass voice, basso -- (the lowest adult male , bass voice, basso -- (the lowest adult male singing voice)singing voice)

7.7. bassbass -- (the member with the lowest range of a family -- (the member with the lowest range of a family of musical instruments)of musical instruments)

8.8. bassbass -- (nontechnical name for any of numerous -- (nontechnical name for any of numerous edible marine and freshwater spiny-finned fishes)edible marine and freshwater spiny-finned fishes)


classpathclasspath ( (is-ais-a relationship) of sense #7 of relationship) of sense #7 of bassbass

(the member with the lowest range of a family of(the member with the lowest range of a family of

musical instruments)musical instruments)

musical instrumentmusical instrument

instrumentinstrument

devicedevice

instrumentality, instrumentationinstrumentality, instrumentation

artifact, artefactartifact, artefact

object, physical objectobject, physical object

entityentity, something, something


EuroWordNetEuroWordNet version (not available on-line) for version (not available on-line) for

different European languages:different European languages:• SpanishSpanish• FrenchFrench• GermanGerman• DutchDutch• CzechCzech• EstonianEstonian• ItalianItalian (CNR-Pisa) (CNR-Pisa)• ……

MultiWordNetMultiWordNet: : ItalianItalian (IRST-Trento, (IRST-Trento, B. MagniniB. Magnini))




9.9. Semantic relatedness of concepts. Semantic relatedness of concepts. 10.10. Ontology matching: meta ontology or machine learning Ontology matching: meta ontology or machine learning

approach?approach?



2. Conceptual Density for WSD2. Conceptual Density for WSD • The problem of The problem of ambiguityambiguity: a : a wordword can have more can have more

than just one meaning (than just one meaning (polysemic)polysemic), i.e., it is , i.e., it is potentially ambiguouspotentially ambiguous

• A A wordword is is disambiguateddisambiguated along with a portion of the along with a portion of the text in which it is embedded (its text in which it is embedded (its contextcontext)...)...

e.g. This e.g. This bassbass has an incredible voice. has an incredible voice.(A fish does not sing...)(A fish does not sing...)

I would love to have a fried I would love to have a fried bassbass..(Cannibalism is banned...and frying the musical instrument (Cannibalism is banned...and frying the musical instrument does not seem a brilant idea either!)does not seem a brilant idea either!)

... and the use of a ... and the use of a lexical resourcelexical resource: a dictionary, a : a dictionary, a thesaurus or... an ontology!thesaurus or... an ontology!

2. Conceptual Density for WSD2. Conceptual Density for WSD • (Automatic) (Automatic) Word Sense Disambiguation (WSD)Word Sense Disambiguation (WSD)

consists in examining word tokens and specifying consists in examining word tokens and specifying exactly which sense of each word is being used, exactly which sense of each word is being used, taking into account the taking into account the contextcontext and using an and using an external external lexical resourcelexical resource

• Importance of WSD for Importance of WSD for NLE tasksNLE tasks::

– Text Categorizatione.g. Category_1: fish markets Category_2: opera

– Information Retrieval– Question Answering

e.g. Who is the most popular Italian bass singer?– Automatic Translation

e.g.Who is the bass singer who eats fried bass?

2. Conceptual Density for WSD2. Conceptual Density for WSD

• SensevalSenseval competition (last in 2004: Senseval-3; competition (last in 2004: Senseval-3; next in 2007: any taker?)next in 2007: any taker?)http://www.senseval.org

• main Senseval tasks:main Senseval tasks:

– All-Word Task (AWT): English, Basque…All-Word Task (AWT): English, Basque…– Lexical Sample Task: English, Italian, Spanish, Lexical Sample Task: English, Italian, Spanish,

Basque…Basque…– WordNet Gloss Disambiguation Task: EnglishWordNet Gloss Disambiguation Task: English

http://www.senseval.org/


Measures for Measures for evaluationevaluation::

• Precision: Precision: # of correctly disambiguated words / # of correctly disambiguated words / # disambiguated words# disambiguated words

• Recall: Recall: # of correctly disambiguated # of correctly disambiguated words / words / # words# words

• Coverage:Coverage: # of disambiguated words / # of disambiguated words / # words# words

2. Conceptual Density for WSD2. Conceptual Density for WSD WSD approaches:WSD approaches:

a.a. Corpus-basedCorpus-based– BayesBayes– Hidden Markov ModelHidden Markov Model– Super Vector MachineSuper Vector Machine– Maximum EntropyMaximum Entropy– Neural NetworksNeural Networks– Genetic Algorithm Genetic Algorithm

b.b. Knowledge-basedKnowledge-based– LeskLesk: # shared words between the : # shared words between the contextcontext and the i-th and the i-th

sense of the sense of the glossgloss (definition + examples) of the word to (definition + examples) of the word to disambiguatedisambiguate

– Conceptual DensityConceptual Density between the between the synsetssynsets of the of the wordword to to disambiguate and the words of the disambiguate and the words of the contextcontext

c.c. HybridHybrid

2. Conceptual Density for WSD2. Conceptual Density for WSD a.a. Corpora Corpora for corpus-based approaches:for corpus-based approaches:

• SemCorSemCor ( (SemSemantic Conantic Concorcordance) corpus:dance) corpus:– Brown corpus in SGML format Brown corpus in SGML format – syntactically (POS) and semantically (synsets) syntactically (POS) and semantically (synsets)

taggedtaggedhttp://www.cogsci.princeton.edu/~wn/

e.g. <wf cmd=done pos=NN lemma=muscle wnsn=1 e.g. <wf cmd=done pos=NN lemma=muscle wnsn=1 lexsn=1:08:00>lexsn=1:08:00>

• Corpora Corpora SensevalSenseval: :

Senseval-1, Senseval-2, Senseval-3Senseval-1, Senseval-2, Senseval-3http://www.senseval.org


http://www.senseval.org/


a. An example of a corpus-based approach: the Bayesian method

• a sliding window is used• a training phase is needed • a classifier is learnt

2. Conceptual Density for WSD2. Conceptual Density for WSD e.g.e.g. “Singer, electrical guitar and bass players

arrived at the concert and bla, bla, bla”

– word to disambiguate: bass– size of sliding window: 4– words transformed into its lemmas

guitar|n, and|cjc, player|n, arrive|v– stopwords not considered

electrical|adj, guitar|n, player|n, arrive|v– Noun Sense Disambiguation (for the sake of

clarity):

singer|n, guitar|n, player|n, concert|n

2. Conceptual Density for WSD2. Conceptual Density for WSD w=w=bassbass C=(C=(singersinger,,guitarguitar,,playerplayer,,concertconcert))

wwii i-th sense of w i-th sense of w

wwiiSSww S Sww : set of the senses of w : set of the senses of w

wwmaxmax = argmax P(w = argmax P(wii|C) = argmax P(C| w|C) = argmax P(C| wii) P(w) P(wii) / P(C)) / P(C)

= argmax P(w= argmax P(wii) ) P(c P(cjj| w| wii))

hp: P(C|whp: P(C|wii)) P(c P(cjj| w| wii))

SmoothingSmoothing techniques in order to avoid null techniques in order to avoid null

probabilitiesprobabilities

2. Conceptual Density for WSD2. Conceptual Density for WSD b. An example of a knowledge-based approach: b. An example of a knowledge-based approach:

the the Conceptual DensityConceptual Density method method

Basque Country University: Basque Country University: (Agirre and Rigau 96)(Agirre and Rigau 96)

improved version Polytechnic University Valencia: improved version Polytechnic University Valencia:

(Buscaldi and Rosso, (Buscaldi and Rosso, 04) 04)

ProblemProblem: not always a domain-specific corpus is: not always a domain-specific corpus is

availableavailable

AimAim: to use the : to use the knowledgeknowledge of an ontology to of an ontology to

disambiguatedisambiguate anyway anyway


From From conceptual distanceconceptual distance between between concepts…concepts…

2. Conceptual Density for WSD2. Conceptual Density for WSD … … to to conceptual densityconceptual density of of conceptsconcepts!!

((synsetssynsets if WordNet if WordNet ontology)ontology)

2. Conceptual Density for WSD2. Conceptual Density for WSD the original approach (Agirre & Rigau):the original approach (Agirre & Rigau):

CD(c,m)=(CD(c,m)=(nhypnhypii) / descendant) / descendantcc descendant descendantcc = =nhypnhypii

c: concept (synset), at the top of the subhierarchyc: concept (synset), at the top of the subhierarchy

nhyp: # of hyponyms per nodenhyp: # of hyponyms per node

h: height of the subhierarchyh: height of the subhierarchy

m: marks (# of senses of the words to disambiguate)m: marks (# of senses of the words to disambiguate)

m-1m-1

i=0i=0 i=0i=0

h-1h-1

2. Conceptual Density for WSD2. Conceptual Density for WSD improved approach (Buscaldi & Rosso):improved approach (Buscaldi & Rosso):

in each subhierarchy, in each subhierarchy, only the relevant synsetsonly the relevant synsets along the along the

classpaths are consideredclasspaths are considered

SynsetContext word synset

Synset of word to be disambiguated


e.g. Fulton_County_Grand_jury said e.g. Fulton_County_Grand_jury said FridayFriday an an investigationinvestigation of of AtlantaAtlanta's recent 's recent primary_electionprimary_election produced no produced no evidenceevidence that that any any irregularitiesirregularities took_place” took_place”

irregularityirregularity is the ( is the (root formroot form of the) noun to be disambiguated in of the) noun to be disambiguated in this case; it has 4 senses:this case; it has 4 senses:

1. Abnormality, irregularity - Behavior that breaches the rule 1. Abnormality, irregularity - Behavior that breaches the rule or etiquette or custom or moralityor etiquette or custom or morality2. Irregularity, unregularity - not characterized by a fixed 2. Irregularity, unregularity - not characterized by a fixed principle or rateprinciple or rate3. Irregularity (geometrical) - an asimmetry in space3. Irregularity (geometrical) - an asimmetry in space4. Constipation, irregularity (medical)4. Constipation, irregularity (medical)


Cluster centroid (top of subhierarchy)Top

17487

698797

113067

118873

119389

261466

377835

416938

472071

474548

475542

13018

1891620056

18604

38487003714294

3745305 3951638

3953966

3954192

12865

13243

4404854

4450489

4450733

4467427

4467719

4470360

4480483

4485976

10266258

10310911

Cluster 1

Cluster 2

Cluster 3

Cluster 4


NounNoun Sense Disambiguation algorithm Sense Disambiguation algorithm

1.1. Select the Select the nounsnouns in the context in the context

““Brakes howled and a Brakes howled and a hornhorn blared furiously, but blared furiously, butthe the manman would have been hit if Phil hadn’t called would have been hit if Phil hadn’t calledout to him a out to him a secondsecond before” (Senseval-3) before” (Senseval-3)

2. 2. build subhierarchiesbuild subhierarchies3. 3. compute densitiescompute densities4. 4. assign the sense with highest CD to the nounassign the sense with highest CD to the noun

(when possible)(when possible)

2. Conceptual Density for WSD2. Conceptual Density for WSD 1.1. We start We start building subhierarchiesbuilding subhierarchies by considering the by considering the word’s word’s

sensessenses and the paths connecting those senses to the root and the paths connecting those senses to the root synsetsynset

2.2. Then we find the Then we find the roots of subhierarchiesroots of subhierarchies: nodes from which : nodes from which only one sense of the word can be reached only one sense of the word can be reached

3.3. Finally, we add the Finally, we add the context wordscontext words’ paths, if they fall within ’ paths, if they fall within the subhierarchiesthe subhierarchies

simplification (spanning tree) of the WordNet (graph): simplification (spanning tree) of the WordNet (graph): only the only the is-ais-a relation is considered relation is considered

Synset representing a

sense of the word to disambiguate

Root of a subhierarchy

Synset representing a

sense of a context word


MM: number of : number of relevant synsetsrelevant synsets falling in one of the falling in one of the subhierarchiessubhierarchies

hh: the : the heightheight of the of the subhierarchysubhierarchy

nhnh: total number of the : total number of the synsetssynsets in the in the subhierarchy subhierarchy

ff: : frequencyfrequency of the of the synsetsynset in WordNet (based on in WordNet (based on

SemCor)SemCor)

f

nh

MMfnhMCD

log

),,(


e.g. e.g. brakebrake (4 senses) with (4 senses) with context wordscontext words: {horn, man, : {horn, man, second}second}

Subhierarchy for sense 1:M=9, nh=21CD=90.10*(9/21)log 1=1.25

Subhierarchy for sense 2:M=1, nh=1, CD=1

Subhierarchy for sense 3:M=1, nh=1, CD=1

Subhierarchy for sense 4:M=1, nh=5CD=10.10*(1/5)log 4=0.07


Adjective Adjective Sense Disambiguation: Sense Disambiguation:

Mutual Domain WeightsMutual Domain Weights

• no Conceptual Densityno Conceptual Density• WordNet Domains and (WordNet) frequencyWordNet Domains and (WordNet) frequency• context of only one word: the noun related to the adjective context of only one word: the noun related to the adjective

(e..g. academic (e..g. academic course)course)

for each sense of the adj and each sense the context noun

FactotumwDomcDomwDomif

FactotumwDomcDomwDomifif

cDomwDomif

cwMDW

fifif

fif

if

if

)()()(10

)()()(/1/1

)()(0

),(111




additional MDWs taking into account adjs and nouns reacheable

with the lexical relations:

• similar sense (adj)similar sense (adj) e.g. far - distante.g. far - distant• antonym (adj)antonym (adj) e.g. small - bige.g. small - big• pertainym (adj-noun)pertainym (adj-noun) e.g. e.g. electrical - electricityelectrical - electricity• attribute (adj-noun)attribute (adj-noun) e.g. small - sizee.g. small - size




c: context nounc: context noun

S: vector of synsets relatedS: vector of synsets related to the to the f-th sense of the adjectivef-th sense of the adjective

k:k: polysemic grade of cpolysemic grade of c ( (## senses of the context senses of the context

noun)noun)

z:z: ## not-null MDWsnot-null MDWs

SW (w f ,c,S)

MDW (w f ,c i) MDW (sm,c i)j0

S

i0

k

z


Verb Verb andand Adverb Adverb Sense Disambiguation: Sense Disambiguation:


• no CDno CD for for verbsverbs::– hyerarchy too shallow (only 3 levels)hyerarchy too shallow (only 3 levels)– usually only usually only one verb for sentenceone verb for sentence

• MDWs between the MDWs between the word senseword sense and the and the senses of senses of context wordscontext words

– verbs in verbs in AWTAWT: : nounnoun precedingpreceding and and followingfollowing the the verbverb

– other tasks: 4 words of any POSother tasks: 4 words of any POS• no MDWs from related synsetsno MDWs from related synsets


Results over the SemCor Results over the SemCor nounsnouns::• Precision: Precision: 81.5%81.5%

baseline Most Frequent Sense (MFS): 75.5%baseline Most Frequent Sense (MFS): 75.5%• Recall: Recall: 59.9%59.9%

Results over the SemCor Results over the SemCor adjectivesadjectives::• Precision: Precision: 72.8% (baseline MFS: 79.4%)72.8% (baseline MFS: 79.4%)• Recall:Recall: 56.5%56.5%


Advantages and drawbacks:Advantages and drawbacks:

• Pro:Pro:– fast, automaticfast, automatic– it distinguishes easy cases from difficult onesit distinguishes easy cases from difficult ones

• Contra:Contra:– low recalllow recall– higher mistake probability in difficult cases (when higher mistake probability in difficult cases (when

we try to add extra weights)we try to add extra weights)– lower precision when recall improveslower precision when recall improves


All-Word Task (English): All-Word Task (English): Senseval-3Senseval-3 competition: competition:

• Difficulty of AWT: Difficulty of AWT: inter-annotator (dis)agreement inter-annotator (dis)agreement 72.5%72.5%

• Best system (corpus-based): Best system (corpus-based): P=R= 65%P=R= 65%• CIAOSENSO (our system): CIAOSENSO (our system): P= 60%P= 60% R=48%R=48%• R2D2 (4-th best system): CIAOSENSO + other R2D2 (4-th best system): CIAOSENSO + other

corpus-based and knowledge-based systems corpus-based and knowledge-based systems


All-Word Task (English): Senseval-3 competition:All-Word Task (English): Senseval-3 competition:

• Nouns: Nouns: P = 74,3% (baseline MFS: 69.8%)P = 74,3% (baseline MFS: 69.8%)

R= 51.2%R= 51.2%• Adjectves:Adjectves: P = 72.8% (baseline MFS: 79.4%)P = 72.8% (baseline MFS: 79.4%)

R= 56.5%R= 56.5%


More Senseval-3 tasks:More Senseval-3 tasks:• Lexical Sample Lexical Sample Task: Task: hybridhybrid (partially supervised) (partially supervised)

system (the training corpus is used to system (the training corpus is used to change the change the ranking of sense frequency)ranking of sense frequency)

• Gloss Gloss Disambiguation Task:Disambiguation Task:

Additional weightsAdditional weights were added when the following were added when the following

relationshipsrelationships were found: were found:– Hypernyms and Hyponyms of the Hypernyms and Hyponyms of the headhead synset synset– Meronyms and HolonymsMeronyms and Holonyms– Pertainyms and AttributesPertainyms and Attributes– WordNet Domain correspondance with the head synset WordNet Domain correspondance with the head synset

(e.g. (e.g. Medicine Medicine for for bloodblood(2) in the definition of (2) in the definition of heartheart(1))(1))

2. Conceptual Density for WSD2. Conceptual Density for WSD Problem 1: Problem 1: low recalllow recall using the Conceptual Density using the Conceptual Density

Attempts to increase the recall (Attempts to increase the recall (nounsnouns) using:) using:

a. WordNet Domainsa. WordNet Domains

b. Density depth correctionb. Density depth correction

c. Specific context correctionc. Specific context correction

d. Context expansion with:d. Context expansion with:

I. The (definition part) of the I. The (definition part) of the glossgloss

II. The II. The global contextglobal context of the document of the document

e. e. GlossesGlosses of of CALD CALD (integrated into WN glosses)(integrated into WN glosses)

f. f. WebWeb as as lexical resource for WSDlexical resource for WSD

Problem 2: poor Problem 2: poor verbverb sense disambiguation sense disambiguation

Attempt using: Attempt using: Support Vector Machines (SVM)Support Vector Machines (SVM)

2. Conceptual Density for WSD2. Conceptual Density for WSD 1a. 1a. WordNet DomainWordNet Domain

Not of great help because of the Not of great help because of the FactotumFactotum category problem category problem

1b. 1b. Density depth correctionDensity depth correctionMore weight is given to a subhierarchy if placed in deeper More weight is given to a subhierarchy if placed in deeper

positions in the ontology:positions in the ontology:

1c. 1c. Specific context correctionSpecific context correctionMore weight is given to a subhierarchy if a context word sense is More weight is given to a subhierarchy if a context word sense is a hyponym (more specific sense) of a sense of the word to a hyponym (more specific sense) of a sense of the word to disambiguatedisambiguate

((small improvements of the recallsmall improvements of the recall but deterioration of the precision) but deterioration of the precision)

CD * (depth(subhierarchy) - avgdepth+1)CD * (depth(subhierarchy) - avgdepth+1)


1d. 1d. Context expansionContext expansion

I. The (I. The (definitiondefinition part) of the (POS-tagged) part) of the (POS-tagged) glossgloss

II. The II. The Global ContextGlobal Context of the document of the documentExtraction of keywords in the document using Extraction of keywords in the document using frequencyfrequency and and

distributiondistribution: : great frequencygreat frequency and and standard deviationstandard deviation (Lee et al. 04) (Lee et al. 04)

PP RR CC

CIAOSENSOCIAOSENSO 74.3%74.3% 49.7%49.7% 66.9%66.9%

CIAOSENSO+GCCIAOSENSO+GC 73.4%73.4% 50.8%50.8% 69.2%69.2%


1e. 1e. GlossesGlosses of of CALD CALD (integrated into WN glosses)(integrated into WN glosses)

• if more than the 40% of the if more than the 40% of the definition partdefinition part of the of the WN glossWN gloss is is found in one of the found in one of the CALD definition partsCALD definition parts of candidate glosses of candidate glosses (~ Lesk knowledge-based approach): (~ Lesk knowledge-based approach): the the sample partsample part of CALD is added to the WN gloss of CALD is added to the WN gloss

1) we search in the CALD Web pages for: 1) we search in the CALD Web pages for: coherencecoherence, …, …2) we calculate the matching % of the definition part2) we calculate the matching % of the definition part3) in case of 40% at least, we add its sample part 3) in case of 40% at least, we add its sample part

• e.g. WN sysnset: e.g. WN sysnset: coherencecoherence, , coherencycoherency, , cohesioncohesion, , cohesivenesscohesiveness (the state of cohering or sticking together) (the state of cohering or sticking together)


1e. 1e. GlossesGlosses of of CALD CALD (integrated into WN glosses)(integrated into WN glosses)

• WN 2.0: 8195 samples + 7416 new CALD samplesWN 2.0: 8195 samples + 7416 new CALD samples• SemCor: with CALD samples P= 79.8% R= 59.76% SemCor: with CALD samples P= 79.8% R= 59.76%

(without(without P= 81.5% R= P= 81.5% R= 59.9%)59.9%)

• Senseval-3 (AWT corpus)Senseval-3 (AWT corpus)- without CALD samples:- without CALD samples: P= 74.3%P= 74.3%R=51.2%R=51.2%

- with:- with:

P R

GW 73.75% 52.14%

GWd 73.98% 51.81%

GWs 74.06% 52.03%


1f. 1f. WebWeb as as lexical resource for WSDlexical resource for WSD

• knowledge acquisition bottleneck (sample size is knowledge acquisition bottleneck (sample size is too small) for WSDtoo small) for WSD

• Web redundancy to disambiguate Web redundancy to disambiguate nounsnouns using using modifier modifier adjectivesadjectives (web hits) (web hits)



Preliminary definitions:Preliminary definitions:

w w wordword

|w||w| senses senses

a a adjectiveadjective

nn synonyms synonyms ssikik of of wwkk

mm words words hhjkjk in the direct hypernym synset of in the direct hypernym synset of wwkk

ffSS(x,y) (x,y) : function returning the # of pages containing “: function returning the # of pages containing “x yx y” ”

(according to the search engine (according to the search engine SS))

ffSS(x) (x) : function returning the # of pages containing : function returning the # of pages containing xx

2. Conceptual Density for WSD2. Conceptual Density for WSD 1f. 1f. WebWeb as as lexical resource for WSDlexical resource for WSD

Web-based algorithm: Web-based algorithm: adjective-nounadjective-noun lexical patterns lexical patterns

1. 1. Select the adjective Select the adjective aa before before ww

2. 2. For each For each wwk k , synonym , synonym ssik ik , hypernym (or hyponym) , hypernym (or hyponym) hhjkjk

compute: compute: ffSS(a,s(a,sikik)) and and ffSS(a,h(a,hjkjk))

3. 3. Assign a weight to each Assign a weight to each wwkk (combining the results of 2.) (combining the results of 2.) using a given formula using a given formula FF

4. 4. Select the Select the wwkk with the highest weight with the highest weight



e. g. Senseval-3: e. g. Senseval-3: ““A A faint creasefaint crease appeared between the man’s eyebrows” appeared between the man’s eyebrows”

creasecrease11={fold, crease, bend,…}={fold, crease, bend,…}creasecrease22={wrinkle, crease, line,…}={wrinkle, crease, line,…}creasecrease33={kris, crease, creese}={kris, crease, creese}

hypernymshypernyms::hh11={angular shape, angularity}={angular shape, angularity}hh22={depression, impression, imprint}={depression, impression, imprint}hh33={dagger, sticker}={dagger, sticker}




sense 1sense 1: :

(faint,fold), (faint,bend), …(faint,fold), (faint,bend), …

(faint, angular shape), (faint,angularity)(faint, angular shape), (faint,angularity)

sense 2sense 2: :

(faint,wrinkle), (faint,line), …(faint,wrinkle), (faint,line), …

(faint, depression), (faint,impression), (faint,imprint)(faint, depression), (faint,impression), (faint,imprint)

sense 3sense 3::

(faint, kris), (faint, creese)(faint, kris), (faint, creese)

(faint, dagger), (faint,sticker)(faint, dagger), (faint,sticker)



Some of the tested formulae, based on:Some of the tested formulae, based on:

• weight averageweight average::FFII: 1/2*( : 1/2*( f fSS(a,s(a,sikik) / n + ) / n + f fSS(a,h(a,hjkjk) / m)) / m)

FFIIII: F: FII with hyponyms with hyponyms

• weight maximumweight maximum::FFIIIIII: max ( f: max ( fSS(a,s(a,sikik) , f) , fSS(a,h(a,hjkjk) )) )

• similarity measuressimilarity measures::FFIVIV: max (f: max (fSS(a,s(a,sikik) log ( f) log ( fSS(a,s(a,sikik) /f) /fSS(s(sikik) ), f) ), fSS(a,h(a,hjkjk) log ( f) log ( fSS(a,h(a,hjkjk) )

/f/fSS(h(hjkjk) ))) ))



FormulaFormula PrecisionPrecision RecallRecall CoverageCoverage PrecPrecnana

MFSMFS 0.6890.689 0.6890.689 100%100% 0.6230.623

II 0.6270.627 0.2710.271 43.3%43.3% 0.3180.318

IIII 0.6610.661 0.2860.286 43.3%43.3% 0.3920.392

IIIIII 0.6600.660 0.2780.278 42.0%42.0% 0.3730.373

IVIV 0.7770.777 0.3370.337 43.3%43.3% 0.6340.634



Search Engines (SE) comparison:Search Engines (SE) comparison:

• MSN, AltaVista + MSN, AltaVista + Lucene with the TREC-Lucene with the TREC-8 document collection8 document collection

• no significative no significative differences between the differences between the Web-based SEWeb-based SE

• Lucene (offline):Lucene (offline):– + precision, + precision,

- coveragecoverage– less dataless data but of but of

better qualitybetter quality



• in general in general better performancebetter performance in in precisionprecision (and (and recallrecall: 4% gain) of the : 4% gain) of the frequency-corrected formulaefrequency-corrected formulae (with probabilities over SemCor)(with probabilities over SemCor)

• importance of importance of polysemy of adjectivespolysemy of adjectives for nouns for nouns sense disambiguation: the sense disambiguation: the less polysemicless polysemic is the is the adjective, the higher is the probability of selecting adjective, the higher is the probability of selecting the right sensethe right sense

• same approach for the same approach for the disambiguation of adjectivesdisambiguation of adjectives (searching for (searching for ffSS(a(aik ik ,w),w)): ): poor precision 21.3%poor precision 21.3%



• in many cases just one in many cases just one adjective is not enoughadjective is not enough to to understand the meaning of a noun (e.g. pair still understand the meaning of a noun (e.g. pair still ambiguous: ambiguous: cold firecold fire cold passioncold passion): a greater ): a greater context should be taken into accountcontext should be taken into account

• better to better to integrateintegrate Web-based approaches and not Web-based approaches and not use them standaloneuse them standalone

2. Conceptual Density for WSD2. Conceptual Density for WSD 2. 2. Verb sense disambiguation with SVMVerb sense disambiguation with SVM

• Problems:Problems:– high polysemyhigh polysemy of verbs compared with the other categories of verbs compared with the other categories– lack of lexical resources providing lack of lexical resources providing relations between verbs relations between verbs

and nounsand nouns• Standard features:Standard features:

– Word FeatureWord Feature : the lexical form of each word : the lexical form of each word– Pos FeaturePos Feature : the Part-Of-Speech Tag of each word in the : the Part-Of-Speech Tag of each word in the

same windowsame window– Word.Pos featureWord.Pos feature : the concatenation of the previous : the concatenation of the previous

featuresfeaturese.g.:e.g.: Reid saw me looking at the iron bars.Reid saw me looking at the iron bars.

WF: WF: Reid, saw, me; at, the, ironReid, saw, me; at, the, ironPF: PF: NNP, VBD, PRP; IN, DT, NNNNP, VBD, PRP; IN, DT, NNWP: WP: Reid.NNP, saw.VBD, me.PRP; at.IN, the.DT, iron.NNReid.NNP, saw.VBD, me.PRP; at.IN, the.DT, iron.NN


2. 2. Verb sense disambiguation with SVMVerb sense disambiguation with SVM

L-hypernymy feature:L-hypernymy feature:

the hyperonyms extracted from WordNet, for each noun the hyperonyms extracted from WordNet, for each noun in the context, at depth L, where L is the number of in the context, at depth L, where L is the number of levels to go up in the WordNet hierarchylevels to go up in the WordNet hierarchy

Sense 1: iron, Fe=> metallic element, metal => chemical element, element => substance, matter => entity

Sense 2: iron => golf club, golf-club, club => golf equipment => sports equipment => equipment => instrumentality ... => entity



ExperimentsExperiments

• avg. # of training samples for each verb: 123.53avg. # of training samples for each verb: 123.53• avg. # of test samples for each verb: 61.81avg. # of test samples for each verb: 61.81

• implementation of SVM: SVM_light implementation of SVM: SVM_light ((http://svmlight.joachims.org/http://svmlight.joachims.org/) by Thorsten Joachims.) by Thorsten Joachims.

• a a modelmodel was built was built for every verbfor every verb of the corpus, using of the corpus, using the training set from the the training set from the Lexical Sample corpusLexical Sample corpus

• 1 ≤ L < 81 ≤ L < 8



0,0%

10,0%

20,0%

30,0%

40,0%

50,0%

60,0%

No L-hyper L=1 L=2 L=3 L=4 L=5 L=6 L=7

w.size = 4

w.size = 5

MF baseline

w.s. 4, sense limit



• WordNet-extracted features did not prove useful WordNet-extracted features did not prove useful

• better results could be achieved if better results could be achieved if only the only the hypernyms of the right sense of the context nounshypernyms of the right sense of the context nouns are considered: Noun Sense Disambiguation neededare considered: Noun Sense Disambiguation needed

1.1. The WordNet ontologyThe WordNet ontology2.2. Conceptual Density for Word Sense DisambiguationConceptual Density for Word Sense Disambiguation3.3. Semantic (Geo) Information Retrieval and Text Semantic (Geo) Information Retrieval and Text

CategorizationCategorization4.4. Semantic clusters of very short narrow-domain textsSemantic clusters of very short narrow-domain texts5.5. Lexical pattern extraction: mining the web with semantic infoLexical pattern extraction: mining the web with semantic info6.6. An ontology for Question Answering and vice versa: An ontology for Question Answering and vice versa:


9.9. Semantic relatedness of concepts. Semantic relatedness of concepts. 10.10. Ontology matching: meta ontology or machine learning Ontology matching: meta ontology or machine learning

approach?approach?

11.11. The case study of biomedical ontologies: semantic The case study of biomedical ontologies: semantic relatedness of concepts of the different ontologiesrelatedness of concepts of the different ontologies


3. Semantic (Geo)3. Semantic (Geo) IR & TC IR & TC

Study of the impact of semantics in the tasks:Study of the impact of semantics in the tasks:

a.a. Retrieval of XML documents (not a NLE task)Retrieval of XML documents (not a NLE task)b.b. Information RetrievalInformation Retrievalc.c. Geo IRGeo IRd.d. Text CategorizationText Categorization

3. Semantic (Geo)3. Semantic (Geo) IR & TC IR & TC a.a. Retrieval of XML documents (not strictly a NLE task)Retrieval of XML documents (not strictly a NLE task)

• high dynamic nature of the Web + increased quantity high dynamic nature of the Web + increased quantity of of informationinformation represented as represented as XML documentsXML documents

• need for evaluating need for evaluating approximate queriesapproximate queries over XML over XML documents: documents: approximationapproximation = documents returned = documents returned even if even if not “completely” meeting the constraintsnot “completely” meeting the constraints the the query imposes query imposes

• returned documents ranked relying on a measure returned documents ranked relying on a measure evaluating the evaluating the similarity degreesimilarity degree


• tree representation of the query and a document (possible answer of the query)

• algorithm matching the two tree structures and returning the similarity degree

• vocabulary approximation: query tags query not “syntactically” matching document tags even if expressing the same concept

e.g. movie vs. film (but film NOT as photographic film)



a.a. Retrieval of XML documents (not strictly a NLE task)Retrieval of XML documents (not strictly a NLE task)


• automatically assigning a automatically assigning a meaningmeaning to the to the query tagsquery tags• disambiguationdisambiguation of a of a query tagquery tag considering the considering the

neighbour tags (context): neighbour tags (context): father, children, brothersfather, children, brothers

e.g.1. query tag = e.g.1. query tag = filmfilm

a context with a context with directordirector helps to disambiguate helps to disambiguate filmfilm: :

(director, manager, managing director)(director, manager, managing director)

(stage director, director)(stage director, director)

(conductor, music director)(conductor, music director)

e.g.2. query tag = e.g.2. query tag = mapmap context tags = context tags = (water, mountain, sea)


a.a. Retrieval of XML documents (not strictly a NLE task)Retrieval of XML documents (not strictly a NLE task)

Top

Entity

Object

Artifact

Creation

representation

Map(1)

Body of Water

Water(2),Sea(1)

Liquid

Substance

Water(1)

Water(5…7)

Fluid

Water(3)

abstraction

Relation

Mathematical Relation

Function

Map (2)

Mountain

CDe(2,9,2)=0.42

CDe(10,28,1)=1.77

Facility

Water(4)

Sea(2)

Natural Obj.

Geological

Measure

Indefinite Quantity

Phenomenon

Quantity

Sea(3)

Element


• query:

• tree representationtree representation

/map[water OR sea = "Mediterranean"] AND /map[mountain = "M.Blanc"]/map[water OR sea = "Mediterranean"] AND /map[mountain = "M.Blanc"]

AND

map

mountain

=

M.Blanc

water

=

sea

Mediterranean

OR

AND

map syn

mountain

syn=

M.Blanc “natural elevation, elevation”

“representation”

syn

“liquid”

water

syn=

sea

Mediterranean “body of water, water”

OR

+ disambiguation :


similarity degreessimilarity degrees: (1) (0.83)(0.92)> >

AND

map syn

mountain

syn=



syn

“liquid”

water

syn=

sea


OR

map

sea mountain

M.BlancMediterranean

water

movie

mountain

name altitude

M.Blanc 4807Garda Lake

representation

sea mountain

M.BlancMediterranean


AND

map syn

mountain

syn=



map

sea mountain

M.Blanc

representation

sea mountain water

movie

mountain

name altitude

M.Blanc 4807

(1) (0.83)(0.92)> >

syn

“liquid”

water

syn=

sea


OR

Mediterranean M.Blanc Garda LakeMediterranean

threshold = 0,90

water

movie

mountain

name altitude

M.Blanc 4807Garda Lake


(Mesiti, Guerrini, Buscaldi, Rosso, 03)(Mesiti, Guerrini, Buscaldi, Rosso, 03)• preliminary experiments: 30 Wepreliminary experiments: 30 Webb docs docs ( (~~ 600 distinct tags) 600 distinct tags)• NLE-oriented WordNet ontology not really well-suited for NLE-oriented WordNet ontology not really well-suited for

disambiguating XML docs: disambiguating XML docs: 30% of tags not30% of tags not contained in WN contained in WN– combination of nouns (e.g. combination of nouns (e.g. productListproductList, , clubnameclubname))– unintelligible abbreviations (e.g. unintelligible abbreviations (e.g. msrbmsrb, , cnamescnames))– verbsverbs– stoplist wordsstoplist words

• similar results with corpus-based (Bayes) and knowledge-similar results with corpus-based (Bayes) and knowledge-based (Conceptual Density) approachesbased (Conceptual Density) approaches

– 40% of the remaining tags were disambiguated correctly40% of the remaining tags were disambiguated correctly– CD approach fasterCD approach faster

• new relationships among tags should be considerednew relationships among tags should be considered

e.g.e.g. “is-used-in-the-context-of”“is-used-in-the-context-of”


Gerhard WeikumGerhard Weikum “Efficient Top-k Queries for XML IR” “Efficient Top-k Queries for XML IR”Workshop “The Future of Web Search”Workshop “The Future of Web Search”Barcelona, May 19-20, 2006Barcelona, May 19-20, 2006

• Structure + Structure + contentcontent + semantics + semantics• Semantic searchSemantic search: semantic inference from the : semantic inference from the

gathered docsgathered docs• Relevance scoreRelevance score based on based on ontology similarity of ontology similarity of

concepts namesconcepts names• GlossesGlosses of WordNet of WordNet• ExpansionExpansion with with extracted infoextracted info (mutual info?) (mutual info?)• Towards a Towards a statistical Semantic Webstatistical Semantic Web

3. Semantic (Geo)3. Semantic (Geo) IR & TC IR & TC b. Information Retrievalb. Information Retrieval

Semantic indexing (with WordNet synsets):Semantic indexing (with WordNet synsets):

• if WSD with Precision >= 90% (Sanderson, 94)if WSD with Precision >= 90% (Sanderson, 94)• ““adding semantic indexing approach” (not an adding semantic indexing approach” (not an

“instead of” approach): query expansion (Krevetz, “instead of” approach): query expansion (Krevetz, 96)96)

e.g. “What financial institutions are found along the banks of the e.g. “What financial institutions are found along the banks of the

Nile?”Nile?” “(financial_institution OR 6003131/N) AND “(financial_institution OR 6003131/N) AND (bank OR 6800223/N) AND (Nile OR 6826174/N)”(bank OR 6800223/N) AND (Nile OR 6826174/N)”

• WSD with Precision >= 70% WSD with Precision >= 70% 29% increase of the 29% increase of the IR RecallIR Recall: experiments over SemCor (Gonzalo, 98): experiments over SemCor (Gonzalo, 98)

3. Semantic (Geo)3. Semantic (Geo) IR & TC IR & TC b. Information Retrievalb. Information Retrieval

Semantic IR experiments:Semantic IR experiments:

b1. b1. Latent Semantic Indexing (LSI) and k-meansLatent Semantic Indexing (LSI) and k-means (Jiménez, Vidal, Rosso, 03)(Jiménez, Vidal, Rosso, 03)

b2. b2. Conceptual clustersConceptual clusters using synonymy, using synonymy, hypernymy, hyponymy, meronymy lexical hypernymy, hyponymy, meronymy lexical relationships relationships (Kang et al. 04)(Kang et al. 04)……..


b1. LSI + k-mean IRb1. LSI + k-mean IR• Time Magazine collection (~500 docs):Time Magazine collection (~500 docs): • comparison semantic lemmatisation vs. stemming comparison semantic lemmatisation vs. stemming

pre-processing: precision 20% lower!pre-processing: precision 20% lower!

3. Semantic (Geo)3. Semantic (Geo) IR & TC IR & TC b1. LSI + k-mean IRb1. LSI + k-mean IR

Why if (Gonzalo, 98) talks about an increase of recall in Why if (Gonzalo, 98) talks about an increase of recall in IR up to 29%?IR up to 29%?

Unfortunately, the error rate of state-of-the-art WSD Unfortunately, the error rate of state-of-the-art WSD systems is greater than 30%: 65% of precision (and systems is greater than 30%: 65% of precision (and recall) of best system in AWT of Senseval-3 recall) of best system in AWT of Senseval-3

Attempt in the future to include Attempt in the future to include semantic indexing semantic indexing only only for relevant words: for relevant words: selection of relevant wordsselection of relevant words first first

3. Semantic (Geo)3. Semantic (Geo) IR & TC IR & TC b2. IR with conceptual clustersb2. IR with conceptual clusters

• Aim: to consider the Aim: to consider the semantic importance semantic importance of the of the words and their concepts in a doc words and their concepts in a doc

• Conceptual clusters Conceptual clusters with k lexical relations with k lexical relations (“identity”, synonymy, hypernymy, hyponymy, (“identity”, synonymy, hypernymy, hyponymy, meronymy) with different weights meronymy) with different weights (Kang et al. 04)(Kang et al. 04)

weigh of word wweigh of word wii : : k_relations * weight_relation k_relations * weight_relationkk

weight of conceptual clusterweight of conceptual cluster ll: : i_w i_wwwii


Conceptual cluster Conceptual cluster CCll is is representativerepresentative if its weight is if its weight is >= of the average of the weights of all the conceptual >= of the average of the weights of all the conceptual clustersclusters

w1 = 1*0.7+2*0.5=1.7

C1= 1.7+0.7+0.5+0.5=3.4

w1(1.7)

w2(0.7) w3(0.5)

w4(0.5)identity

synonymy

Cluster1 (3.4)


b2. IR with conceptual clustersb2. IR with conceptual clusters““This This exerciseexercise routineroutine, developed by Steve, , developed by Steve, directordirector of rehabilitation at the of rehabilitation at the Centre for Centre for SpineSpine, and , and administratoradministrator of the Health Association, focuses on of the Health Association, focuses onthe right spots: It strengthens your the right spots: It strengthens your backback musclemuscle, abdominals, and obliques , abdominals, and obliques (the ab (the ab musclesmuscles that run from front to back along your lower ribs) and that run from front to back along your lower ribs) and stretches your legs and hips. Combine this stretches your legs and hips. Combine this practicepractice with three 30-minutes with three 30-minutes sessions of cardiovascular sessions of cardiovascular activityactivity such as such as joggingjogging, and you should be on , and you should be on your way to a healthier your way to a healthier backback””

back(0.7)

spine(0.5)

back(1.2)

C1(2.4)

muscle(0.7)

muscle(0.7)

C2(1.4)

director(0.3)

administrator(0.3)

C3(0.6)

exercise(0.6)

practice(0.3)

activity(1.2)

routine(0.3)

jogging(0.6)

C4(3.0)

The words The words practicepractice and and directordirector have the same weight, have the same weight, BUT practice is semantically more important:BUT practice is semantically more important:importance wimportance wii w.r.t. its representative w.r.t. its representative CCll

2

j

j

li

|C|C |w|


Comparison with Comparison with TF*IDFTF*IDF (TREC-2 collection): (TREC-2 collection): • semantic indexessemantic indexes: words with a greater importance : words with a greater importance

than the average importance of all the wordsthan the average importance of all the words

• increaseincrease of the of the precisionprecision of 10% of 10%

• reductionreduction of the of the sizesize of the indexed terms of of the indexed terms of 80%80%!!

3. Semantic (Geo)3. Semantic (Geo) IR & TC IR & TC c. Geo IRc. Geo IR

Geo IR could be Geo IR could be ambiguousambiguous::

• (Geo I) R(Geo I) RRetrieval of information involving some kind of Retrieval of information involving some kind of spatial awarenessspatial awareness (Fred Gey @ (Fred Gey @ GeoCLEFGeoCLEF 2005) 2005)E.g. E.g. Find news about riots in France.Find news about riots in France.

• Geo (IR)Geo (IR)

Not a particular aspect of Spatial Information RetrievalNot a particular aspect of Spatial Information Retrieval

E.g. E.g. What is the river flowing through Paris?What is the river flowing through Paris?

3. Semantic (Geo)3. Semantic (Geo) IR & TC IR & TC c. Geo IRc. Geo IR

(Almost) The same (Almost) The same Geographical EntityGeographical Entity can be can be indicated in several indicated in several differentdifferent (and sometimes (and sometimes ambiguous) ambiguous) mannersmanners::

• United Kingdom of Great United Kingdom of Great Britain and Northern IrelandBritain and Northern Ireland• United Kingdom, UK, U.K. + United Kingdom, UK, U.K. + Northern IrelandNorthern Ireland• Great Britain, GB + Northern Great Britain, GB + Northern IrelandIreland• Reino Unido, Gran BretagnaReino Unido, Gran Bretagna• British IslesBritish Isles

3. Semantic (Geo)3. Semantic (Geo) IR & TC IR & TC c. Geo IRc. Geo IRGeographical Conceptual Networks in WordNetGeographical Conceptual Networks in WordNet

N. Ireland

UK

EnglandScotland

Wales

British Isles

Great BritainIreland

(Hibernia)

Ireland(Eire)

Holonym

Meronym

3. Semantic (Geo)3. Semantic (Geo) IR & TC IR & TC c. Geo IRc. Geo IRExploiting WordNetExploiting WordNet

• Solving Solving synonymysynonymy::

E.g. synset corresponding to “E.g. synset corresponding to “U.K.U.K.”:”:

{United Kingdom, UK, U.K., Great Britain, GB, Britain, United {United Kingdom, UK, U.K., Great Britain, GB, Britain, United

Kingdom of Great Britain and Northern Ireland}Kingdom of Great Britain and Northern Ireland}• Finding Finding missing (geographical) informationmissing (geographical) information::

– MeronymyMeronymy (“has member/part” relationship) (“has member/part” relationship)– HolonymyHolonymy (“is member/part of”) (“is member/part of”)

Two Two solutions testedsolutions tested::• Query Expansion (QE)Query Expansion (QE)• Index Terms Expansion (ITE)Index Terms Expansion (ITE)

3. Semantic (Geo)3. Semantic (Geo) IR & TC IR & TC c. Geo IR: c. Geo IR: QEQE

Expand the geographical termsExpand the geographical terms of the query with of the query with

their their synonymssynonyms and ( and (somesome) ) meronymsmeronyms• geographical terms are identified through the geographical terms are identified through the

WordNet ontology: words having the WordNet ontology: words having the synsetsynset {region, {region, location}location} among their among their hypernymshypernyms

• MeronymsMeronyms containing the word “ containing the word “capitalcapital” in the ” in the definition (gloss)definition (gloss) or in the or in the meronym synsetmeronym synset itself itself

3. Semantic (Geo)3. Semantic (Geo) IR & TC IR & TC c. Geo IR: c. Geo IR: QEQE

e.g. e.g. Foreign minorities in GermanyForeign minorities in Germany• ““Germany” appears in the synset: {Germany, Germany” appears in the synset: {Germany,

Federal Republic of Germany, Deutschland, Federal Republic of Germany, Deutschland,

FRG}FRG}• The following meronyms contain the word “capital”:The following meronyms contain the word “capital”:

Berlin, german Berlin, german capitalcapital

Bonn (was the Bonn (was the capitalcapital of Germany between 1949 and 1989) of Germany between 1949 and 1989)

Munich, Muenchen (Munich, Muenchen (capitalcapital of Bavaria) of Bavaria)

Aachen, Aken, Aix-la-Chapelle (formerly Charlemagne Aachen, Aken, Aix-la-Chapelle (formerly Charlemagne northern northern capitalcapital))

3. Semantic (Geo)3. Semantic (Geo) IR & TC IR & TC c. Geo IR: c. Geo IR: ITEITE

• Find geographical termsFind geographical terms in the text collection: in the text collection:

openNLPopenNLP Named Entities detector ( Named Entities detector (http://opennlp.sourceforge.net))

• Put all their Put all their holonymsholonyms and and synonymssynonyms into a special into a special geo indexgeo index

Search Engine used: Lucene (Search Engine used: Lucene (http://lucene.jakarta.org))

• Label geographical termsLabel geographical terms in the in the queryquery with the with the geo geo search fieldsearch field::

e.g. “riots in France” -> text:riots geo:Francee.g. “riots in France” -> text:riots geo:France

http://opennlp.sourceforge.net/

http://lucene.jakarta.org/

3. Semantic (Geo)3. Semantic (Geo) IR & TC IR & TC c. Geo IR: c. Geo IR: ITEITE

““OnOn Sunday morningsSunday mornings, the, the covered market oppositecovered market opposite thethe stationstation in thein the leafy suburbleafy suburb ofof Aulnay-sous-BoisAulnay-sous-Bois -- barelybarely half an hour'shalf an hour's drivedrive fromfrom centralcentral ParisParis -- spills spills opulentlyopulently on to theon to the streetsstreets andand boulevardsboulevards.”.”

From WordNet:From WordNet:Paris, Paris, French capitalFrench capital, , capital of Francecapital of France, , city of lightcity of light

France, French RepublicFrance, French RepublicEuropeEurope

Northern hemisphereNorthern hemisphere

- To geographical index- To geographical index - To standard index- To standard index

3. Semantic (Geo)3. Semantic (Geo) IR & TC IR & TC c. Geo IRc. Geo IRExperiment setupExperiment setup

• GeoCLEF 2005 collection and queriesGeoCLEF 2005 collection and queries

Los Angeles Times 1994Los Angeles Times 1994

Glasgow Herald 1995Glasgow Herald 1995• Topic DescriptionTopic Description runs: runs:

e.g. TD from queries:e.g. TD from queries:

““Shark attacks near California and Australia”Shark attacks near California and Australia”

““Vegetable exporters of Europe”Vegetable exporters of Europe”

““Holidays in the Scottish Trossachs”Holidays in the Scottish Trossachs”• 1000 results returned for each query1000 results returned for each query

3. Semantic (Geo)3. Semantic (Geo) IR & TC IR & TC c. Geo IRc. Geo IRResults QAResults QA

0%

20%

40%

60%

80%

100%

0 1 2 3 4 5 6 7 8 9 10

Recall levels

Pre

cis

ion

Clean System with QE

3. Semantic (Geo)3. Semantic (Geo) IR & TC IR & TC c. Geo IRc. Geo IRResults QAResults QA

Why did it perform so bad? Two major errors:Why did it perform so bad? Two major errors:• Inconsistent expansionsInconsistent expansionse.g. e.g. “Sacramento” expanding “Sacramento” expanding CaliforniaCaliforniain the query: in the query: “Shark attacks in California”“Shark attacks in California”

• AmbiguityAmbiguitye.g. “Europe” in “Vegetable exporters of Europe”e.g. “Europe” in “Vegetable exporters of Europe”WordNet returns three senses forWordNet returns three senses for““Europe”:Europe”:

Europe as continentEurope as continentEurope as the European UnionEurope as the European UnionEurope as the set of nations on theEurope as the set of nations on theEuropean continentEuropean continent

3. Semantic (Geo)3. Semantic (Geo) IR & TC IR & TC c. Geo IRc. Geo IRResults ITEResults ITE

0%

20%

40%

60%

80%

100%

0 1 2 3 4 5 6 7 8 9 10

Recall levels

Pre

cis

ion

Clean System with ITE

3. Semantic (Geo)3. Semantic (Geo) IR & TC IR & TC c. Geo IRc. Geo IRRemarksRemarks

• ITE better than QEITE better than QE– It seems to be It seems to be less sensitive to ambiguityless sensitive to ambiguity problems problems– However: it needs However: it needs NE recognition during the indexing NE recognition during the indexing

phase (not trivial)phase (not trivial)

• The The WordNet ontologyWordNet ontology cancan be used as a be used as a Geo IRGeo IR::– To be To be evaluated againstevaluated against a specialized resource like the a specialized resource like the

TGNTGN http://www.getty.edu/research/conducting_research/vocabularies/tgn/

http://www.getty.edu/research/conducting_research/vocabularies/tgn/



3. Semantic (Geo)3. Semantic (Geo) IR & TC IR & TC d. TCd. TC

• Text CategorisationText Categorisation: labelling natural language docs : labelling natural language docs with thematic categories from a predefined setwith thematic categories from a predefined set

• classifierclassifier learns from a training set the learns from a training set the correspondencecorrespondence between between docsdocs and and categoriescategories::

– the k-NN method finds its the k-NN method finds its kk Nearest Neighbours among Nearest Neighbours among the training docsthe training docs

– the the categoriescategories of the of the k neighboursk neighbours used to select the used to select the nearest categorynearest category for the test doc for the test doc

3. Semantic (Geo)3. Semantic (Geo) IR & TC IR & TC d. TCd. TC• Vector Space Model (Salton)Vector Space Model (Salton)• TF*IDF TF*IDF weighting schemeweighting scheme

• text codification: itext codification: ithth component of d: d component of d: d ii = =

n: nonen: none == n: nonen: none = = 1 n: none = n: none = 1

b: binary = b: binary = 0/1 t: t: c: cosinec: cosine

m = m =

a = a =

l = l =

NORMFIDFT idid ,,

idTF ,

)(maxmax

,

,

idi

id

TF

TFnorm

)(max5.05.0

,

,

idi

id

TF

TFnormavg

)log(1log ,idTF

)log(iDF

Ntfidf

iidid FIDFT 2

,,

1



• Hidden Markov Model corpus-based system WSD Hidden Markov Model corpus-based system WSD (Molina and Pla, 02)(Molina and Pla, 02)

• term selectionterm selection process process

– to to optimise the list of terms that identify the collectionoptimise the list of terms that identify the collection (optimisation focused to (optimisation focused to reduce the number of terms with reduce the number of terms with poor informationpoor information))

– Information GainInformation Gain (IG) method: IG measures the amount of (IG) method: IG measures the amount of information which contributes a term for the prediction of a information which contributes a term for the prediction of a category (as a function of its presence or absence in a category (as a function of its presence or absence in a given text)given text)


d. TCd. TCExperiments on data sets (Ferretti, Errecalde, Rosso, 05):Experiments on data sets (Ferretti, Errecalde, Rosso, 05):

• 20 Newsgroup20 Newsgroup– ~20.000 news messages (sent in 1993) of ~20.000 news messages (sent in 1993) of 20 Usenet discussion groups20 Usenet discussion groups

((categoriescategories))– 1000 documents per category except for the soc.religion.christian 1000 documents per category except for the soc.religion.christian

category that contains 997category that contains 997• WebKBWebKB

– ~5.000 ~5.000 Computer Science Web pages of universitiesComputer Science Web pages of universities (Cornell, Texas, (Cornell, Texas, Washington, Wisconsin and miscellaneous)Washington, Wisconsin and miscellaneous)

– pages divided into pages divided into 7 categories: student, faculty, staff, course, project, 7 categories: student, faculty, staff, course, project, department and otherdepartment and other

• Reuters-21578Reuters-21578– ~1.600 newswire docs distributed in 22 SGML format files~1.600 newswire docs distributed in 22 SGML format files– 5 different sets of 5 different sets of content related economic categoriescontent related economic categories


0

5

10

15

20

25

atc btc ltc anc bnc lnc ltn mnn mtn

20 Newsgroups WebKB Reuters

Codifications per corpora

Err

or p

erce

ntag

e

Terms Synsets Terms + Synsets


Complete vocabulary (20 Newsgroups): 31.786 terms + 27.652 synsetsComplete vocabulary (20 Newsgroups): 31.786 terms + 27.652 synsets

0

10

20

30

40

50

60

70

50 100 200 400 800 1600 3200 6400 12800 25600

Size of vocabulary

Err

or p

erce

ntag

e



Complete vocabulary (WebKB): 17.251 terms + 13.475 synsetsComplete vocabulary (WebKB): 17.251 terms + 13.475 synsets

0

5

10

15

20

25

30

50 100 200 400 800 1600 3200 6400 12800

Size of vocabulary

Err

or p

erce

ntag

e



Complete vocabulary (Reuters): 28.969 terms + 14.580 synsetsComplete vocabulary (Reuters): 28.969 terms + 14.580 synsets

0

5

10

15

20

25

30

35

50 100 200 400 800 1600 3200 6400 12800 25600

Size of vocabulary

Err

or p

erce

ntag

e



• the impact of the use of semantic information the impact of the use of semantic information depends on the particular characteristics of the depends on the particular characteristics of the corpuscorpus

• in corpora in corpora richer syntactically and semanticallyricher syntactically and semantically the the inclusion of semantic information allow a small inclusion of semantic information allow a small improvement improvement if vocabulariesif vocabularies with a sufficient number with a sufficient number of features are considered (of features are considered (50% of the original size50% of the original size))

• attempt in the future to include attempt in the future to include semantic indexing semantic indexing only for relevant words: only for relevant words: selection of relevant wordsselection of relevant words firstfirst




approach?approach?



4. Semantic clusters of 4. Semantic clusters of very short narrow-domain textsvery short narrow-domain texts

ProblemsProblems• organization of text set =organization of text set =>> data structuringdata structuring• searching interesting texts =searching interesting texts =>> clustering based clustering based navigationnavigation

Typical situation Typical situation • free access to free access to full-text full-text scientific papers is limited to only their scientific papers is limited to only their

abstractsabstracts consisting of no more than several dozens of consisting of no more than several dozens of words words

• sometimes the set of sometimes the set of full-text full-text scientific papersscientific papers on a given on a given domaindomain areare not available not available is is absent absent at allat all and a library has and a library has only only abstractsabstracts

Typical opinionTypical opinionusual usual keyword-basedkeyword-based methods work well methods work well


Very short textsVery short texts• texts from texts from differentdifferent domains domains• texts from texts from narrow narrow domainsdomains SocietySociety SciencesSciences Physics Physics

CultureCulture PhysicsPhysics Nuclear physics Nuclear physics

EconomicsEconomics ChemistryChemistry Experimental physics Experimental physics

PoliticsPolitics BiologyBiology Optical physicsOptical physics

……………… …………………… ………………

NoNo intersection intersection WeakWeak intersection intersection StrongStrong intersection intersection

of vocabularies of vocabularies of vocabularies of vocabularies of vocabularies of vocabularies

Problem:Problem: the stronger the the stronger the vocabulary intersectionvocabulary intersection is, the more is, the more

unstableunstable results are results are

Very short textsVery short texts• news news and other and other self-containedself-contained• abstractsabstracts of full scientific texts or technical papers of full scientific texts or technical papers

AbstractsAbstracts explain the explain the goalsgoals of the research reported in the paper of the research reported in the paper (the problem), while (the problem), while paperspapers explain the explain the methodsmethods used to achieve used to achieve these goals (i.e., the algorithms)these goals (i.e., the algorithms)

Our goal is to shorten the gap between:Our goal is to shorten the gap between:1.1. Automatic Automatic abstractabstract clustering clustering vs. vs.

manual manual abstract abstract clusteringclustering2.2. Automatic Automatic abstractabstract clustering clustering vs. vs.

manual manual paperpaper clustering clustering

Problem:Problem: impreciseimprecise results when clustering abstracts results when clustering abstracts


Very short texts Very short texts (50-100 words)(50-100 words)• absolute frequency of indexes are sometimesabsolute frequency of indexes are sometimes 3-4 generally 3-4 generally

0-20-2• onlyonly 5%-15% 5%-15% of the vocabulary is used in every text of the vocabulary is used in every text


Traditional approach Traditional approach

1. Constructing word frequency list1. Constructing word frequency liststop-words are eliminated stop-words are eliminated

words having the same base meaning are joined (words having the same base meaning are joined (stemmingstemming))

2. Constructing text images 2. Constructing text images according to according to TFTF or or TF-IDF techniques TF-IDF techniques tftfi,ji,j = = ffi,ji,j / max / maxffi,ji,j idf idfii = Log ( = Log (N/nN/nii)) ii-th word, -th word, jj-th text-th text

3. Clustering 3. Clustering using theusing the cosine measure cosine measure From (2) :From (2) : highhigh randomness randomness in text imagesin text images

Results: Results: not such a big problem when texts are from different not such a big problem when texts are from different domains, but when they are narrow domain…domains, but when they are narrow domain…


Struggling for stabilityStruggling for stability Using compensative effectUsing compensative effect• to join indexes (keywords) :to join indexes (keywords) : ((ww11 , w , w22 , ....w , ....wnn) => ) => WW11 = = ((ww11, w, w33, w, w1919), ), WW22 = = ((ww77, w, w1313, w, w2323),.. ),..

• to cluster abstracts in new index space (cluster coordinates):to cluster abstracts in new index space (cluster coordinates): ((WW11 , W , W22 ,.... ,....))

Selection of group of indexesSelection of group of indexes1. Use 1. Use synsetssynsets of an appropriate of an appropriate ontologyontology2. Use a 2. Use a thesaurusthesaurus of a given domain of a given domain

3. 3. ClusterCluster the words in the space of texts the words in the space of texts 4. Use the 4. Use the transition point techniquetransition point technique5. 5. Conceptual clusters Conceptual clusters (using WordNet synsets)(using WordNet synsets)


Selection of group of indexesSelection of group of indexes

3.3. ClusterCluster the the wordswords in the space of texts: in the space of texts:

MajorClustMajorClust algorithm algorithm

Weighting indexesWeighting indexes WWkk = ∑ = ∑ ddi,j i,j / / NNkk, ,

kk is the number of the cluster is the number of the cluster

ii and and jj are the elements of this clusters ( are the elements of this clusters (ii ≠ ≠ jj) )

NNkk is the number of links in the cluster is the number of links in the cluster kk

(Alexandrov, Gelbukh, Rosso, 05)(Alexandrov, Gelbukh, Rosso, 05)


An object belongs to theAn object belongs to the cluster whom the majority of cluster whom the majority of its neighboursits neighbours belong to belong to

Sub-optimal solutionSub-optimal solution: :

only a limited part of only a limited part of neighbours is consideredneighbours is considered


Struggling for precisionStruggling for precision Using a more adequate measureUsing a more adequate measure

cosine measurecosine measure CC1,21,2 = =

where where 1, 21, 2 are the numbers of texts are the numbers of texts

xxk,ik,i are the cluster coordinates are the cluster coordinates

Coordinate transformation:Coordinate transformation:

xxkiki = log (1 + = log (1 + ffk,ik,i) / log (1 + max( ) / log (1 + max( ffii))))

Aim:Aim: smoothing smoothing of high frequencies typical of high frequencies typical abstract wordsabstract words

(e.g. method, experiment, result etc.)(e.g. method, experiment, result etc.)

,

,

21

21

xx

xxk

kk


Clustering indexesClustering indexesMajorClustMajorClust method: method: number of clusters is defined number of clusters is defined automaticallyautomatically

Clustering abstractsClustering abstracts• NNNN method method (hierarchy-based)(hierarchy-based)• K-meansK-means method method (example-based)(example-based)• MajorClustMajorClust method method (density-based)(density-based)

Abstracts Abstracts (preliminary results) improved using(preliminary results) improved using::• compensative effect compensative effect • logarithmic measurelogarithmic measure


ExperimentsExperiments: : Clustering Clustering abstracts CICLingabstracts CICLing-2002-2002

Indexing: 390 keywordsIndexing: 390 keywords

Gold standard: Gold standard: 44 clusters (obtained also with MajorCluster): clusters (obtained also with MajorCluster):LinguisticLinguistic (semantics, syntax, morphology, parsing) (semantics, syntax, morphology, parsing)

AmbiguityAmbiguity (word sense disambiguation, anaphora, tagging, spelling) (word sense disambiguation, anaphora, tagging, spelling)

LexiconLexicon (lexicon and corpus, text generation) (lexicon and corpus, text generation)

Text processingText processing (information retrieval, summarization, text classification) (information retrieval, summarization, text classification)

Narrow domain:Narrow domain: e.g. Ve.g. V22 ∩ V ∩ V44 = 70% = 70%

Indexing log Scaling F-measure TF-IDF No 0.64 TF No 0.57 Grouping Yes 0.78 Grouping No 0.68


F-measureF-measure = 2 P R / (P + R) = 2 P R / (P + R)

Digital library and Internet repositories should provide Digital library and Internet repositories should provide

open accessopen access both to both to abstractsabstracts and to and to document imagesdocument images

of full papersof full papers: this does not violate the copyright of authors! : this does not violate the copyright of authors!

(proposal by (proposal by Dr. MakagonovDr. Makagonov,, Mixteca University of Technology, Mexico)Mixteca University of Technology, Mexico)


Selection of group of indexesSelection of group of indexes

4. Use the 4. Use the Transition Point Transition Point ((TPTP) technique) technique• based on the based on the Zip lawZip law of word frequency: of word frequency: mid-term frequency mid-term frequency

termsterms are closely related to the are closely related to the conceptual contentconceptual content of a of a documentdocument

– I1: # words of frequency equal to 1

– if very short texts: if very short texts: TP = the lowest frequency that is not repeatedTP = the lowest frequency that is not repeated

• mid-term frequency termsmid-term frequency terms + + mutual-information termsmutual-information terms (over the same collection)(over the same collection)

(Pinto, Jiménez, Rosso, 05)(Pinto, Jiménez, Rosso, 05)


Experiments over corpora:Experiments over corpora:• CICLing-02CICLing-02• Hep-ex CERNHep-ex CERN (High Energy Physics narrow-domain)narrow-domain)• MedlineMedline (cancer sub-corpus) (cancer sub-corpus)


15

13

11

10

8

7

…

1

7

frequency

vocabulary

ambiguos

retrieval

…

language

cross

James

information

PT

15

13

11

10

8

7

…

1

7

frequency

vocabulary

ambiguos

retrieval

…

language

cross

James

information

(1 + Threshold)*PT

(1 - Threshold)*PT

Term selection methodsTerm selection methods

• document frequencydocument frequency: : to a term t is assigned the weight to a term t is assigned the weight dfdftt

(# texts where t occurs)(# texts where t occurs)

• term strengthterm strength

where sim(Twhere sim(Tii, T, Tjj) ≤ β) ≤ β TTii, T, Tjj: :

textstexts• transition pointtransition point

where TPwhere TPvv: Transition Point value: Transition Point value

freq(t,T) of the term t in the text Tfreq(t,T) of the term t in the text T


jiTtTtts jit with ),|(Pr

1),(1

),(

TtfreqTP

TtidtpV

Experimental resultsExperimental results

(CICLing corpus)(CICLing corpus)


without enrichmentwithout enrichment

with enrichmentwith enrichment

vocabulary sizevocabulary size


(Hep-ex corpus)(Hep-ex corpus)



(Hep-ex corpus)(Hep-ex corpus)


DFDF – Document Frequency – Document FrequencyTSTS – Term Strength – Term StrengthTPTP – Transition Point – Transition PointTPMITPMI – Transition Point enriched with a thesaurus (Mutual Information) – Transition Point enriched with a thesaurus (Mutual Information)

Selection of group of indexes Selection of group of indexes “under construction”“under construction”

a.a. use the use the TP TP technique with technique with termsterms

+ + sense-taggingsense-tagging (with WordNet synsets) of (with WordNet synsets) of selected selected termsterms:: terms vs. synsets vs. terms + synsets terms vs. synsets vs. terms + synsets

b.b. use the use the TP TP technique withtechnique with synsets synsets

(sense-tag all terms first):(sense-tag all terms first):

terms terms (of a.) (of a.) vs. synsets vs. terms + synsetsvs. synsets vs. terms + synsets


Selection of group of indexesSelection of group of indexes5. 5. Conceptual cluster Conceptual cluster (using WordNet synsets)(using WordNet synsets)

• Composed by Composed by two or more document nounstwo or more document nouns connected by connected by one or more relations (one or more relations (word-count identityword-count identity and and WordNet: WordNet: synonymy, hypernymy, meronymysynonymy, hypernymy, meronymy))see in see in 33. . IR with conceptual clustersIR with conceptual clusters

• Experiments Experiments (CICLing2002 corpus)(CICLing2002 corpus)

F-measure: 0.44 (poor results)F-measure: 0.44 (poor results)

• Further work: to investigate other Further work: to investigate other lexical relationslexical relations + their + their weightweight


1.1. The WordNet ontologyThe WordNet ontology2.2. Conceptual Density for Word Sense DisambiguationConceptual Density for Word Sense Disambiguation3.3. Semantic (Geo) Information Retrieval and Text CategorizationSemantic (Geo) Information Retrieval and Text Categorization4.4. Semantic clusters of very short narrow-domain textsSemantic clusters of very short narrow-domain texts5.5. Lexical pattern extraction: mining the web with semantic Lexical pattern extraction: mining the web with semantic

informationinformation6.6. An ontology for Question Answering and vice versa: An ontology for Question Answering and vice versa:



approach?approach?



polysemic word w

sensecorpora

synonyms for senses of Wordnet synsets

snippets

lexical pattern selection

search engine

web

5. Lexical pattern extraction: 5. Lexical pattern extraction: mining the web with semantic infomining the web with semantic info

ff

S PP

1. 1. StrengthStrength of the lexical pattern P: of the lexical pattern P:

ffPP : frequency of : frequency of PP in the sense corpus in the sense corpus

ff : average frequency of all lexical patterns in the corpus : average frequency of all lexical patterns in the corpus

: standard deviation: standard deviation

2. 2. Internal dispersionInternal dispersion of the lexical pattern of the lexical pattern PP::Does PDoes P occur in the occur in the contextcontext of of allall the synonyms of a the synonyms of a sensesense of of ww ? ?Sense relevantSense relevant ! ! 3. 3. External dispersionExternal dispersion of the lexical pattern of the lexical pattern PP::Does PDoes P occur in just occur in just one sense corpusone sense corpus of of ww ? ?Sense relevant Sense relevant !!



Bootstrapping techniqueBootstrapping technique for corpora construction: for corpora construction:

• Categorization task (natural disaster corpus)Categorization task (natural disaster corpus)• Named Entity RecognitionNamed Entity Recognition• Word Sense DisambiguationWord Sense Disambiguation

(Guzmán, Montes, Rosso, 05)(Guzmán, Montes, Rosso, 05)




approach?approach?



• Text REtrieval (TREC)Text REtrieval (TREC) Collection: Collection: trec.nist.gov/• Cross-Language Evaluation Forum (CLEF) Cross-Language Evaluation Forum (CLEF) competition: competition:

www.clef-campaign.orgmain CLEF-05 tasks:main CLEF-05 tasks:

– Mono-, Bi- and Multilingual Document Retrieval on News Collections Mono-, Bi- and Multilingual Document Retrieval on News Collections (Ad-Hoc)(Ad-Hoc)

– Mono- and Cross-Language Information Retrieval on Structured Mono- and Cross-Language Information Retrieval on Structured Scientific DataScientific Data (Domain-Specific) (Domain-Specific)

– Interactive Cross-Language Information Retrieval (iCLEF)Interactive Cross-Language Information Retrieval (iCLEF) Multilingual Language Question Answering (Multilingual Language Question Answering (QA@CLEFQA@CLEF) ) – Cross-Language Retrieval in Image Collections (ImageCLEF)Cross-Language Retrieval in Image Collections (ImageCLEF)– Cross-Language Spoken Document Retrieval (CL-SR)Cross-Language Spoken Document Retrieval (CL-SR)– Multilingual Web Track (Multilingual Web Track (WebCLEFWebCLEF))– Cross-Language Geographical Retrieval (Cross-Language Geographical Retrieval (GeoCLEFGeoCLEF))

6. An ontology for QA and vice 6. An ontology for QA and vice versa: OntotripleQAversa: OntotripleQA

http://www.clef-campaign.org/

http://www.clef-campaign.org/

e.g. CLEF-05 question: “Who is Silvio Berlusconi?”e.g. CLEF-05 question: “Who is Silvio Berlusconi?”

Answers:Answers:

Italian Prime Minister Italian Prime Minister (not valid anymore...)(not valid anymore...)

Italian PremierItalian Premier (not valid anymore...)(not valid anymore...)

Business TycoonBusiness Tycoon (still valid)(still valid)

Italy’s richest personItaly’s richest person (still valid) (still valid)

Leader of Forza ItaliaLeader of Forza Italia (still valid)(still valid)

Mediaset’s managing director Mediaset’s managing director (still valid)(still valid)

Milan’s presidentMilan’s president (still valid)(still valid)

... other possible answers could be added even if ... other possible answers could be added even if occurring with less redundancy in the document collection occurring with less redundancy in the document collection (or the Web)...(or the Web)...


3-level architecture3-level architecture::• Question Question Classification/Classification/AnalysisAnalysis• Block/Block/passagepassage/sentence based /sentence based Search Search

EngineEngine• Answer ExtractionAnswer Extraction

UPV UPV QUASARQUASAR system: a system: a Multilingual Multilingual QuQuestion estion AAnswering nswering SSystemystem based on a based on a Language-Language-Independent PIndependent Paassage ssage RRetrieval Engineetrieval Engine

6. An ontology for QA and vice 6. An ontology for QA and vice versa: versa: OntotripleQAOntotripleQA


QUASAR Question Answering system architectureQUASAR Question Answering system architecture

Question AnalysisQuestion Analysis

Named Entity Recognition (NER)Named Entity Recognition (NER)

in the question and in the documentsin the question and in the documents

• to understand the to understand the rolerole of an of an entityentity: : minimum minimum contextcontext info extraction info extraction

• conceptual representationconceptual representation of documents of documents

(entities + contexts) with an (entities + contexts) with an ontologyontology


QUASAR’s Question AnalysisQUASAR’s Question Analysis • mainly mainly pattern matchingpattern matching-based-based

– JAVA Regular Expressions (java.util.regex.*)JAVA Regular Expressions (java.util.regex.*)– patterns are stored in an XML filepatterns are stored in an XML file– 4 languages: Italiano, Español, English, Français4 languages: Italiano, Español, English, Français

• question are assigned the class related to the question are assigned the class related to the matching pattern matching pattern

(if a question matches with more than a pattern: (if a question matches with more than a pattern: the class related to the longest one is assigned)the class related to the longest one is assigned)

• 3-level ontology of 18 hierarchical classes 3-level ontology of 18 hierarchical classes (each class corresponds to a (each class corresponds to a strategy and/or set of strategy and/or set of patternspatterns used in the used in the Question Analysis moduleQuestion Analysis module))



Question AnalysisQuestion Analysis module module

• 18 hierarchically grouped classes18 hierarchically grouped classes– roots: NAME;DEFINITION;DATE;QUANTITYroots: NAME;DEFINITION;DATE;QUANTITY

• Hybrid classification:Hybrid classification:– Combination of SVM + regular expressions for Combination of SVM + regular expressions for

Spanish and EnglishSpanish and English– Regular expressions for French and ItalianRegular expressions for French and Italian

• Extraction of “pivot” wordsExtraction of “pivot” words– Based on rulesBased on rulese.g. How many e.g. How many inhabitantsinhabitants were in were in SwedenSweden in in 19941994 ? ?


L1 L2 L3

NAME ACRONYM

PERSON

TITLE

LOCATION COUNTRY

CITY

GEOGRAPHICAL

DEFINITION

DATE DAY

MONTH

YEAR

WEEKDAY

QUANTITY MONEY

DIMENSION

AGE

QUASAR’s multilingual question classification ontologyQUASAR’s multilingual question classification ontology


QUASAR’s ontology: pattern file sample (it)QUASAR’s ontology: pattern file sample (it)

<pattern class="DATE"><pattern class="DATE"><ptrtext>Quando .+</ptrtext><ptrtext>Quando .+</ptrtext><pattern class="YEAR"><pattern class="YEAR">

<ptrtext>(?i).*(che|quale) anno .+</ptrtext><ptrtext>(?i).*(che|quale) anno .+</ptrtext></pattern></pattern><pattern class="MONTH"><pattern class="MONTH">

<ptrtext>(?i).*(che|quale) mese .+</ptrtext><ptrtext>(?i).*(che|quale) mese .+</ptrtext></pattern></pattern><pattern class="DAY"><pattern class="DAY">

<ptrtext>(?i).*(che|quale) data .+</ptrtext><ptrtext>(?i).*(che|quale) data .+</ptrtext><ptrtext>(?i).*(che|quale) giorno .+</ptrtext><ptrtext>(?i).*(che|quale) giorno .+</ptrtext>

</pattern></pattern><pattern class="WEEKDAY"><pattern class="WEEKDAY">

<ptrtext>(?i).*(che|quale) giorno della settimana .+</ptrtext><ptrtext>(?i).*(che|quale) giorno della settimana .+</ptrtext></pattern></pattern>

</pattern></pattern>



Passage RetrievalPassage Retrieval module: module: language-independentlanguage-independent JIRSJIRS (Java Information Retrieval System) (Java Information Retrieval System) http://jirs.dsic.upv.es

http://jirs.dsic.upv.es/

e.g. ¿Quién es la viuda de John Lennon?

(Who is the widow of John Lennon?)

es la viuda de John Lennon 1x6-gram

es la viuda de John, la viuda de John Lennon 2x5-gram

es la viuda de, la viuda de John, viuda de John Lennon 3x4-gram

es la viuda, la viuda de, viuda de John, de John Lennon 4x3-gram

es la, la viuda, viuda de, de John, John Lennon 5x2-gram

es, la, viuda, de, John, Lennon 6x1-gram

Query N-Grams


Passage N-Grams

Both passagesBoth passages contain the most relevant words contain the most relevant words

((JohnJohn and and LennonLennon) but the ) but the passage 1passage 1 has one has one

5-gram5-gram whereas the whereas the passage 2passage 2 has two has two 2-grams2-grams

... inicio ... inicio de lade la carrera carrera de Johnde John fuera de fuera de los Beatles ... musicales que incluían a los Beatles ... musicales que incluían a John LennonJohn Lennon, Eric Clapton, Keith , Eric Clapton, Keith Richards ... .Richards ... .

La viuda de John LennonLa viuda de John Lennon 1 x 5-gram1 x 5-gram

... 4, 2004.-... 4, 2004.- La viuda de John Lennon La viuda de John Lennon, , Yoko Ono, se ha movilizado ... .Yoko Ono, se ha movilizado ... .

Passage 1Passage 1 Passage 2Passage 2

John LennonJohn Lennon 2 x 2-gram2 x 2-gramde lade la


Term weightTerm weight

nnkk: # of passages in which t: # of passages in which tkk occurs occurs

NN: the total # of passages: the total # of passages

)ln(1

)ln(1

N

nw k

k


L L ((LengthLength): # of ): # of words between words between xx and and xmax xmax ngramsngrams

otherwise

DxifxxdwDxh j

x

kk

j

0

),(),( max

||

1

... 4, 2004.- La viuda de John Lennon, Yoko Ono, se ha ... 4, 2004.- La viuda de John Lennon, Yoko Ono, se ha movilizado ... .movilizado ... .

Passage 1Passage 1

... inicio de la carrera de John fuera de los Beatles ... musicales que ... inicio de la carrera de John fuera de los Beatles ... musicales que incluían a John Lennon, Eric Clapton, Keith Richards ... .incluían a John Lennon, Eric Clapton, Keith Richards ... .

Passage 2Passage 2)1ln(1

1),( max L

xxd

Density Distance N-Gram ModelDensity Distance N-Gram Model


Density Distance N-Gram ModelDensity Distance N-Gram Model

0.0610.061 0.0610.061 0.3890.389 0.0610.061 0.3590.359 0.3950.395¿Quién es la viuda de John Lennon?¿Quién es la viuda de John Lennon?es la viuda de John Lennones la viuda de John Lennon 1.3261.326

... 4, 2004.- ... 4, 2004.- La viuda de John LennonLa viuda de John Lennon, , Yoko Ono, se ha movilizado ... .Yoko Ono, se ha movilizado ... .

Passage 1Passage 1... inicio ... inicio de la de la carrera de John fuera de carrera de John fuera de los Beatles ... musicales que incluían a los Beatles ... musicales que incluían a John LennonJohn Lennon, Eric Clapton, Keith , Eric Clapton, Keith Richards ... .Richards ... .

Passage 2Passage 2

Passage 1Passage 1La viuda de John LennonLa viuda de John Lennon + 1.265+ 1.265

1.2651.265Passage 2Passage 2de lade la 0.122 x 0.2870.122 x 0.287John LennonJohn Lennon + 0,754+ 0,754

0.7890.789

0.9540.954

0.5950.595

d(x,xd(x,xmaxmax))


Answer Extraction

Two philosophies:Two philosophies:

• Direct Direct searchsearch of the of the answeranswer, and later, search of, and later, search of

the the justificationjustification (QRISTAL, U.Amsterdam) (QRISTAL, U.Amsterdam)

• Direct Direct searchsearch of the of the passagepassage, with , with patternpattern

matchingmatching rules to obtain the answer (QUASAR, rules to obtain the answer (QUASAR,

INAOE, DFKI)INAOE, DFKI)



Answer ExtractionAnswer Extraction module module

QUASAR’s Answer ExtractionQUASAR’s Answer Extraction

Three sub-modules:Three sub-modules:TextcrawlerTextcrawler: selects the patterns that can match with the right : selects the patterns that can match with the right

answer (answer (candidatescandidates) in each of the passages and assigns ) in each of the passages and assigns a a weightweight depending on depending on distance from “pivot” wordsdistance from “pivot” words

FilterFilter: rejects patterns that cannot match with the right answer : rejects patterns that cannot match with the right answer (if it is the case, it asks for another candidate to the (if it is the case, it asks for another candidate to the Textcrawler: very useful for “location” questions)Textcrawler: very useful for “location” questions)

SelectorSelector: : selectsselects, from the candidates of every passage, the , from the candidates of every passage, the bestbest one. one.


Selection strategiesSelection strategies

SVSV ( (Simple VotingSimple Voting): the most voted candidate is selected ): the most voted candidate is selected (NOUN questions)(NOUN questions)

WVWV ( (Weighted VotingWeighted Voting): the candidate with heaviest votes is ): the candidate with heaviest votes is selected (TIME, QUANTITY, DEFINITION and, if selected (TIME, QUANTITY, DEFINITION and, if ambiguity, for DV)ambiguity, for DV)

MWMW ( (Maximum WeightMaximum Weight): the candidate with maximum weight ): the candidate with maximum weight is selected (used only if there is ambiguity using WV)is selected (used only if there is ambiguity using WV)

DVDV ( (Double VotingDouble Voting): exactly like SV, but taking into account ): exactly like SV, but taking into account also the second best candidates of each passage (if also the second best candidates of each passage (if ambiguity with SV)ambiguity with SV)

TOPTOP: candidate returned by best passage is selected (only to : candidate returned by best passage is selected (only to compute the CWS)compute the CWS)



FrenchFrench

Italian Italian


SpanishSpanish

Confiance Weighting Score (CWS)

In QUASAR the CWS is computed using the followingvariables:

– concordance: number of selection strategies giving the same answer divided by total number of strategies

– rarity: number of the returned passages divided by maximum number of passages that the JIRS PR module can return

– passage weight: weight of the selected passage

For the NIL answers, the CWS is:1-rarity if there are returned passages, 0 in other cases


The The key-role of ontologykey-role of ontology in QA: in QA:

• best QA systems: best QA systems: – fr-fr: fr-fr: Synapse Synapse (64% - 2n(64% - 2ndd best: 35%...) best: 35%...)– pr-pr: pr-pr: PriberamPriberam (64.5% - 2 (64.5% - 2ndnd best: 25%...) best: 25%...)

• both QA systems based on the both QA systems based on the QRISTAL QRISTAL multilingual ontologymultilingual ontology (developed by the private (developed by the private research centre research centre Synapse Development ToulouseSynapse Development Toulouse) )

• Priberam QA system also: (“Priberam QA system also: (“headhead” of) ” of) question question expansion with synonyms expansion with synonyms


• “diversified” search engine:– Index of definitions (INAOE, DFKI) – Index of answer type (QRISTAL)

• Classification:– used by QRISTAL, PRIBERAM and QUASAR– (mostly) based on patterns (regular expressions)– more classes, better results?

• 86 classes for PRIBERAM y QRISTAL• DFKI: 5 classes, INAOE: 3, QUASAR: 4

– deep analysis (syntax + semantics) of the question does not bring significative improvements

• use of “Pivot” words



Español -> Español42,00%

tova051eses 41,00%33,50%33,00%32,50%29,00%

mira051eses 25,50%

inao051eses

upv051esesalia051esesaliv051esestalp051eses

QAQA JIRS PR JIRS PR-based best systems @ -based best systems @ Spanish CLEF-Spanish CLEF-20052005 track track http://clef-qa.itc.it/2005/

http://clef-qa.itc.it/2005/


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

0,5

0,55

Yahoo

Yahoo-JIRS

Number of passages

Cov

erag

e

Web-based JIRS Passage RetrievalWeb-based JIRS Passage Retrieval: : Yahoo vs. Yahoo-JIRSYahoo vs. Yahoo-JIRS (CLEF-2005 QA Spanish track) (CLEF-2005 QA Spanish track)

OntotripleQAOntotripleQA: QA for ontology (Kim et al. 05): QA for ontology (Kim et al. 05)

• application of QA techniques to application of QA techniques to relation extractionrelation extraction• ontologyontology: : (class, relation, class)(class, relation, class)

e.g. e.g. (person, paint, painting)(person, paint, painting)

(person, is_member_of, group)(person, is_member_of, group)

• ontology instancesontology instances stored as stored as triplestriples::e.g. e.g. paint (Renoir Pierre-August, “The paint (Renoir Pierre-August, “The

umbrellas”)umbrellas”)

• triplestriples can be can be incompleteincomplete (one missing named entity) (one missing named entity) e.g.e.g. is_member_of (Renoir Pierre- is_member_of (Renoir Pierre-

August, ?)August, ?)

converted into a converted into a natural language questionnatural language question::e.g. e.g. Which group Renoir Pierre-August was member Which group Renoir Pierre-August was member

of?of?

search on the Web: answer = search on the Web: answer = impressionistimpressionist

6. An ontology for QA and 6. An ontology for QA and vice vice versaversa: OntotripleQA: OntotripleQA

1.1. The WordNet ontologyThe WordNet ontology2.2. Conceptual Density for Word Sense DisambiguationConceptual Density for Word Sense Disambiguation3.3. Semantic (Geo) Information Retrieval and Text CategorizationSemantic (Geo) Information Retrieval and Text Categorization4.4. Semantic clusters of very short narrow-domain textsSemantic clusters of very short narrow-domain texts5.5. Lexical pattern extraction: mining the web with semantic Lexical pattern extraction: mining the web with semantic

informationinformation6.6. An ontology for Question Answering and vice versa: An ontology for Question Answering and vice versa:


9.9. Semantic relatedness of conceptsSemantic relatedness of concepts10.10. Ontology matching: via a meta ontology or a machine Ontology matching: via a meta ontology or a machine

learning approach?learning approach?11.11. The case study of biomedical ontologies: semantic The case study of biomedical ontologies: semantic

relatedness of concepts of the different ontologiesrelatedness of concepts of the different ontologies12.12. Lab session: CALD, WordNet vs. Genia, GenOntology, MeSHLab session: CALD, WordNet vs. Genia, GenOntology, MeSH

Spanish Railway ServiceSpanish Railway Service

Goal: Goal: Designing Designing automatic automatic dialogues systemsdialogues systems

Problem: Revealing the Problem: Revealing the typical typical scenariosscenarios of of

dialogdialog

Data: 100 realData: 100 real dialogues dialogues

Difficulties:Difficulties: Info is Info is fuzzyfuzzy Info is Info is absentabsent Info is in a Info is in a hidden formhidden form

(Alexandrov, Sanchís, Rosso, 05)(Alexandrov, Sanchís, Rosso, 05)

DIDI: Renfe customer service, good morning: Renfe customer service, good morningUSUS:: Good morningGood morningDIDI: May I help you?: May I help you?USUS:: Yes, please: I would like to know Yes, please: I would like to know

about a train from Valencia to about a train from Valencia to Barcelona.Barcelona.

DIDI: What day are you interested in?: What day are you interested in?USUS:: Next Thursday, in the afternoon. Next Thursday, in the afternoon.DIDI: Let’s see. <PAUSE> On Thursday : Let’s see. <PAUSE> On Thursday there is an EuroMed leaving at 3 P.M. there is an EuroMed leaving at 3 P.M.

and arriving in Barcelona at 6.45 P.M. and arriving in Barcelona at 6.45 P.M. USUS:: What about the next train? What about the next train? DIDI: It leaves at 8 P.M.: It leaves at 8 P.M.USUS:: Too late. Thank you. Bye. Too late. Thank you. Bye.

USUS = User = User DI DI = Directory Inquire Service = Directory Inquire Service Length = 25% like thisLength = 25% like this

7. 7. Cluster analysis of Cluster analysis of transcribed spoken dialoguestranscribed spoken dialogues

Spanish Railway ServiceSpanish Railway Service

Usual solution:Usual solution: ManualManual evaluation evaluation of person-to person of person-to person

dialogues based on dialogues based on lexical analysislexical analysis

Additional results of lexical analysis:Additional results of lexical analysis: Why citizens of Why citizens of TarragonaTarragona like to travel on like to travel on SundaySunday?? Why citizens of Why citizens of MadridMadrid like to like toask for ask for discountsdiscounts??

Example of solution:Example of solution: Hour of departure, discountsHour of departure, discounts Hour of departure, priceHour of departure, price Return ticketReturn ticket Type of trainType of train


Type of parametersType of parameters Reflecting Reflecting transporttransport service service Reflecting Reflecting passengerpassenger behaviour behaviour

List of parametersList of parameters Town importance 0, 0.25,…1 Town importance 0, 0.25,…1

Urgency Urgency 0, 0.5, 10, 0.5, 1 Return ticket 1/0Return ticket 1/0 Time of departure Time of departure Time of departure (return) Time of departure (return) Wagon-lit 1/0Wagon-lit 1/0 Discounts Discounts 1/01/0 Length of talking Length of talking 0, 0.25,….10, 0.25,….1 Politeness Politeness 0, 0.25, 10, 0.25, 1

… …

DifficultiesDifficulties Information is Information is fuzzyfuzzy Information is Information is absentabsent Information is in a Information is in a hidden formhidden form

Nominal scalesNominal scales Time of departure: Time of departure: Indifference 1/0Indifference 1/0 Morning or day 1/0Morning or day 1/0 Evening or night 1/0Evening or night 1/0 =>=> [(1,0,0) , (0,1,0), (0,0,1)] [(1,0,0) , (0,1,0), (0,0,1)]

PresumptionPresumption For absent parameters For absent parameters it is used: it is used: - a value of - a value of indifferenceindifference - the - the cheapestcheapest and and simplestsimplest alternative alternative


ProblemsProblems

• influence of influence of dominant parametersdominant parameters => => real real structurestructure will be will be hiddenhidden

• influence of influence of noisenoise =>=> real real structurestructure will be will be disfigureddisfigured

Parameter analysis Parameter analysis => => filtering parameters:filtering parameters:GroupsGroups of parameters of parameters

1. Significant value for 1. Significant value for 90%-95%90%-95% of objects of objects

2. Significant value for 2. Significant value for 5%-10%5%-10% of objects of objects

=>=> 3. Significant3. Significant value for value for more ~ 20%-30% more ~ 20%-30% of objectsof objects

RoleRole of parameters: of parameters:1. First group parameters are oriented to 1. First group parameters are oriented to uniformuniform object set: object set: eliminatedeliminated

2. Second group parameters oriented to very granulated object set (in 2. Second group parameters oriented to very granulated object set (in

subsystemssubsystems): ): eliminatedeliminated



Objects/Attributes matrixObjects/Attributes matrixClustering methods:Clustering methods:NNNN method methodK-meansK-means method methodMajorClustMajorClust method method


Some conclusions:Some conclusions:

• scenarios of dialogues may be determined by scenarios of dialogues may be determined by clustering clustering them in the them in the space of parametersspace of parameters defined by defined by an expertan expert

• importance of how to parameterize dialogues in order importance of how to parameterize dialogues in order to compensate to compensate incompleteness and fuzziness incompleteness and fuzziness of of source informationsource information

• procedure of procedure of weightingweighting dialogues and parameters dialogues and parameters allows to obtain information allows to obtain information useful useful for a user for a user

• the the MajorClust methodMajorClust method seems to be the one for seems to be the one for solving this kind of problemssolving this kind of problems


Further workFurther work

RailwayOntoRailwayOnto (under construction), based on the (under construction), based on the

transcribed spoken dialogues: transcribed spoken dialogues:

• hierarchy of questionshierarchy of questions: by : by multilevel clustering of questionsmultilevel clustering of questions • typical typical chain of questionschain of questions

for:for:• a better a better cluster analysiscluster analysis• an an anticipationanticipation of the following of the following user’s question(s)user’s question(s)

((query recommendationquery recommendation))




9.9. Semantic relatedness of conceptsSemantic relatedness of concepts10.10. Ontology matching: via a meta ontology or a machine Ontology matching: via a meta ontology or a machine

learning approach?learning approach?11.11. The case study of biomedical ontologies: semantic The case study of biomedical ontologies: semantic

relatedness of concepts of the different ontologiesrelatedness of concepts of the different ontologies12.12. Lab session: CALD, WordNet vs. Genia, GenOntology, MeSHLab session: CALD, WordNet vs. Genia, GenOntology, MeSH

10. The Clasitext ontology. 10. The Clasitext ontology. Applications with ontologies Applications with ontologies

The importance of analyzing and understanding docs to:The importance of analyzing and understanding docs to:• summarize themsummarize them• to compare two documents (w.r.t. a given topic)to compare two documents (w.r.t. a given topic)• to answer non trivial questions to answer non trivial questions

e.g.e.g.

Having Having read these textsread these texts::

Frogs live in waterFrogs live in water

Benito Juárez is buried in San Fernando cemeteryBenito Juárez is buried in San Fernando cemetery

Let us Let us answer these questionsanswer these questions::

Do frog get wet?Do frog get wet?

Where is the left big toe of Benito Juárez buried?Where is the left big toe of Benito Juárez buried?


The Spanish The Spanish CLASITEX ontologyCLASITEX ontology

(A. Guzmán, NPI, Mexico):(A. Guzmán, NPI, Mexico):tree-form conceptstree-form concepts::

e.g. a e.g. a conceptconcept denoting an denoting an objectobject

WEARABLE GARMENT (concept)WEARABLE GARMENT (concept)

SHOESHOE

sandalsandal

moccasinmoccasin

bootboot

SHIRTSHIRT

T-shirtT-shirt

long-sleeve shirtlong-sleeve shirt

PANTSPANTS


e.g.e.g. a a process (action, verbs)process (action, verbs)

MOVEMOVE CHANGE OF (X, Y) POSITION (of coordinates in Earth)CHANGE OF (X, Y) POSITION (of coordinates in Earth)

FloatFloatSwimSwimWalkWalk

runrunjumpjump

FlyFlysoarsoar

ROTATE (change orientation)ROTATE (change orientation)VIBRATEVIBRATECHANGE OF SIZECHANGE OF SIZE

INFLATE-EXPANDINFLATE-EXPANDCONTRACT-GET SMALLERCONTRACT-GET SMALLER

CHANGE OF HEALTHCHANGE OF HEALTHGET WELLGET WELLGET SICKGET SICKDIEDIE

CHANGE OF ECONOMIC STATUSCHANGE OF ECONOMIC STATUSGET POORGET POORGET RICHGET RICH


e.g.e.g. a a relationrelation RELATIVE OFRELATIVE OF

FATHER OFFATHER OF

SON OFSON OF

First son ofFirst son of

Preferred son ofPreferred son of

BROTHER OFBROTHER OF


Concept trees (CLASITEX)Concept trees (CLASITEX)::Spanish, English, FrenchSpanish, English, French


CLASITEX algorithmCLASITEX algorithm1. (BEGIN) Observe the 1. (BEGIN) Observe the sequence of four wordssequence of four words pointed at by the pointer. Do they pointed at by the pointer. Do they denote some denote some concept(s)concept(s)??

YesYes: increase by 1 the : increase by 1 the counterscounters of each of these concepts. Go to step (5). of each of these concepts. Go to step (5).NoNo: Go to next step (2).: Go to next step (2).

2. Observe the sequence of 2. Observe the sequence of three wordsthree words pointed at by the pointer. Do they denote some pointed at by the pointer. Do they denote some concept(s)concept(s)??

YesYes: increase by 1 the counters of each of these concepts. Go to step (5).: increase by 1 the counters of each of these concepts. Go to step (5).NoNo: Go to next step (3).: Go to next step (3).

3. Do same for the sequence of 3. Do same for the sequence of two wordstwo words. Go to step (5) or next step (4).. Go to step (5) or next step (4).4. Observe the 4. Observe the wordword pointed at. Does it denote some concept(s)? pointed at. Does it denote some concept(s)?

YesYes: increase by 1 the counters of each of these concepts. Go to step (5).: increase by 1 the counters of each of these concepts. Go to step (5).NoNo: Is it a word denoting no concept? That is, is it a meaningless word, according to : Is it a word denoting no concept? That is, is it a meaningless word, according to §3.4.1?§3.4.1?

YesYes: ignore it. Go to step (5) next.: ignore it. Go to step (5) next.NoNo: Print it as “I do not know what this word means, or if it is meaningless” : Print it as “I do not know what this word means, or if it is meaningless”

(One has to add these meanings, later, by hand, to the CLASITEX tree). Go to step (One has to add these meanings, later, by hand, to the CLASITEX tree). Go to step (5) next.(5) next.

5. Move the pointer to the right of the word(s) already analyzed. Repeat the operation or 5. Move the pointer to the right of the word(s) already analyzed. Repeat the operation or iteration: go to step (1) (BEGIN). When all text has been analyzed (we can no longer iteration: go to step (1) (BEGIN). When all text has been analyzed (we can no longer move the pointer to the right), report the most popular move the pointer to the right), report the most popular topicstopics as the themes of the as the themes of the docdoc..


Concept contributionConcept contribution in a text (CLASITEX): the in a text (CLASITEX): the “institutions”“institutions” concept concept

Concept representativityConcept representativity: : percentagepercentage of the of the wordswords related with it related with it


In In CLASITEXCLASITEX a a nodenode of the tree (a of the tree (a conceptconcept) is represented by a ) is represented by a

filefile containing containing single, double, triple or quadruple wordssingle, double, triple or quadruple words

e.g. the e.g. the 3-word sequence3-word sequence for the concept for the concept computadoracomputadora ( (computercomputer):):arquitectura de computadoras. base de datos. bases de datos. convertidor analogico digital. arquitectura de computadoras. base de datos. bases de datos. convertidor analogico digital.

convertidor digital analogico. convertidores analogico digital. convertidores analogico digitales.convertidor digital analogico. convertidores analogico digital. convertidores analogico digitales.

convertidores digital analogico. convertidores digital analogicos. editor de texto.convertidores digital analogico. convertidores digital analogicos. editor de texto.

editores de texto. estructura de computadoras. hoja de calculo. hojas de calculo.editores de texto. estructura de computadoras. hoja de calculo. hojas de calculo.

programa de computo. programa de graficacion. programas de computo. sistema de informacion.programa de computo. programa de graficacion. programas de computo. sistema de informacion.

sistemas de informacion.sistemas de informacion.

e.g. e.g. the the 2-word sequence2-word sequence for the concept for the concept utensilio-de-cocinautensilio-de-cocina ((kitchen-toolkitchen-tool):):bano maria. plato hondo. plato sopero. bano maria. plato hondo. plato sopero. plato tendido. vaso refresquero.plato tendido. vaso refresquero.

e.g. the sequence of e.g. the sequence of wordswords for the concept for the concept acto-jurídicoacto-jurídicoapelacion. audiencia. condena. condenada. condenadas. condenado. condenados. culpable. apelacion. audiencia. condena. condenada. condenadas. condenado. condenados. culpable.

delictuosa. delictuosas. delictuoso. delictuosos. delito. embargo. juicio. delictuosa. delictuosas. delictuoso. delictuosos. delito. embargo. juicio. procesado. procesado.

procesados. procesamiento. procesar. proceso. sentencia. sentenciado. sobresellar.procesados. procesamiento. procesar. proceso. sentencia. sentenciado. sobresellar.


Problem with Problem with short articlesshort articles: :

• histograms contain histograms contain fewer conceptsfewer concepts • less difference between the main and the secondary less difference between the main and the secondary

topicstopics (and noise…) (and noise…)

StemmingStemming or or lemmatizationlemmatization (or other techniques to (or other techniques to

group words: see part 4 of seminar) should be group words: see part 4 of seminar) should be

necessarynecessary


Ontology constructionOntology construction

• For an For an optimal ontology complexityoptimal ontology complexity : :– # # levelslevels– average # topics (nodes) in each level (average # topics (nodes) in each level (breadth of each breadth of each

levellevel))– conceptual densityconceptual density? Each part should not be too dense?? Each part should not be too dense?

• Method of Inductive Auto-Regulation of Models Method of Inductive Auto-Regulation of Models

((MIARMMIARM): mathematical model (): mathematical model (Ivakhnenko, 1980Ivakhnenko, 1980) to ) to

construct an ontology with optimal complexity (function construct an ontology with optimal complexity (function

of the above parameters)of the above parameters)


The level of the The level of the granularity of an ontologygranularity of an ontology depends on depends on

how generic is the NLE how generic is the NLE tasktask::

1.1. clustering of textsclustering of texts

2.2. definition of definition of main topics in a docmain topics in a doc• new knowledge discoveringnew knowledge discovering• noise filteringnoise filtering

3.3. domain trend analysisdomain trend analysis


1.1. clustering of textsclustering of texts

• a a similarity measuresimilarity measure is needed to calculate the similarity is needed to calculate the similarity between between two docstwo docs

• Domain DictionaryDomain Dictionary (DomD) words: (DomD) words: importance coefficientimportance coefficient (0..1](0..1] for the for the fuzzy relationfuzzy relation between between wordword and and DomDom; using:; using:

– an an expertexpert opinion opinion– statistical propertiesstatistical properties of keywords in docs of the domain: of keywords in docs of the domain:

p p ((wi wi | | DomDom)) normalized by normalized by max{max{pp((wjwj | | DomDom)})}– the keywords the keywords abstraction levelabstraction level: : domain-orienteddomain-oriented concept concept ontologyontology

e.g. e.g. m levelsm levels word: level jword: level j importance coefficient = j/mimportance coefficient = j/m

• just keywordsjust keywords, not all words, not all words


DDomDDom


2. definition of 2. definition of main topics in a doc main topics in a doc (see previous example about “concept contribution” in a text)(see previous example about “concept contribution” in a text)

• QA systems: generalization by an ontology if there is no QA systems: generalization by an ontology if there is no answeranswere.g. How many cars are sold in Camogli?e.g. How many cars are sold in Camogli?

There isn’t any info, but in Genova are sold x carsThere isn’t any info, but in Genova are sold x cars

• docs summarizationdocs summarizationin the Mexican Senate: system developed by Gelbukh (NPI) based on in the Mexican Senate: system developed by Gelbukh (NPI) based on CLASITEX concepts treesCLASITEX concepts trees


2. definition of 2. definition of main topics in a doc main topics in a doc (see previous example about “concept contribution” in a text)(see previous example about “concept contribution” in a text)

• new knowledge discovering / extractionnew knowledge discovering / extraction::

if in a if in a texttext there are there are words relatedwords related with a certain subject and with a certain subject and their their conceptual densityconceptual density is greater than a certain threshold is greater than a certain threshold e.g. a e.g. a texttext with words about some with words about some diseasedisease, , treatmenttreatment, , hospitalhospital: :

text about text about medicinemedicine even without using the word medicine even without using the word medicine

• noise filteringnoise filteringwithout the without the support of one of its lower levelssupport of one of its lower levels, a topic has to be discarded , a topic has to be discarded as main theme as main theme

e.g. e.g. We cannot sell FIAT cars with the weather conditions of Egypt.We cannot sell FIAT cars with the weather conditions of Egypt.

in the analyzed text there is no other word related with weather in the analyzed text there is no other word related with weather (temperature, rain, sunshine, wind…) (temperature, rain, sunshine, wind…)

We cannot sell FIAT cars in Egypt.We cannot sell FIAT cars in Egypt.


3. 3. domain trend analysisdomain trend analysis (scientometrics)(scientometrics)

Analysis of a great number of publications of a certain domain per Analysis of a great number of publications of a certain domain per year: year: strengthen of some conceptsstrengthen of some concepts ( (some others notsome others not mentioned at mentioned at all)all)

e.g. Medicinee.g. Medicine


3. 3. domain trend analysisdomain trend analysis (scientometrics) (scientometrics)

• level 0 approximation: if the total level 0 approximation: if the total load of the area decreasesload of the area decreases, , new sub-areas could be addednew sub-areas could be added!!

• more more fine analysisfine analysis comparing of the comparing of the conceptconcept load with the load with the wordswords of level 0 (the lowest level): if the concept load does of level 0 (the lowest level): if the concept load does not change, and their words load yes, the not change, and their words load yes, the concept changed concept changed its contentits content!!


3. 3. domain trend analysisdomain trend analysis (scientometrics) (scientometrics)

• trendtrend with with fixed conceptsfixed concepts: only : only redistributionredistribution of concept load of concept load

• DictionaryDomainDictionaryDomain approach approach (Makgonov, (Makgonov, Mixteca University of Technology, Mexico)Mixteca University of Technology, Mexico)

– keywordskeywords per per yearyear (not fixed along the years) (not fixed along the years)– clustering techniqueclustering technique to establish a to establish a stable set of domain stable set of domain

keywordskeywords to represent the to represent the trend of the domaintrend of the domain




approach?approach?



9. Semantic relatedness of concepts 9. Semantic relatedness of concepts

10. Ontology matching: 10. Ontology matching: meta ontology or machine learning approach?meta ontology or machine learning approach?

• Suggested Upper Merged Ontology Suggested Upper Merged Ontology ((SUMOSUMO))by IEEE Standard Upper Ontology Working Groupby IEEE Standard Upper Ontology Working Group

(e.g. mapping of WordNet)(e.g. mapping of WordNet) http://www.ontologyportal.org

• ResearchCycResearchCyc http://research.cyc.com/

open source version of the Cyc technology: the word open source version of the Cyc technology: the word largest and largest and most complete general knowledge basemost complete general knowledge base and and common sense reasoning enginecommon sense reasoning engine

• GLUE systemGLUE system employs machine learning techniques employs machine learning techniques to semi-automatically create semantic mappings (A. to semi-automatically create semantic mappings (A. Doan et al. 03)Doan et al. 03)

http://www.ontologyportal.orgec.nist.gov/

http://r/




approach?approach?

11.11. The case study of biomedical ontologies: semantic The case study of biomedical ontologies: semantic relatedness of concepts of the different ontologiesrelatedness of concepts of the different ontologies


11. Biomedical ontologies: semantic relatedness 11. Biomedical ontologies: semantic relatedness of concepts of the different ontologiesof concepts of the different ontologies

• Genia ontologyGenia ontology (medline corpus) (medline corpus)

• GenOntologyGenOntology

• MeSHMeSH


Genia ontologyGenia ontology ( (medline corpusmedline corpus))

<daml:<daml:ClassClass rdf:ID=" rdf:ID="sourcesource"></daml:Class>"></daml:Class>

<daml:Class rdf:ID="<daml:Class rdf:ID="naturalnatural"><rdfs:"><rdfs:subClassOfsubClassOf rdf:resource="#source"/></daml:Class>rdf:resource="#source"/></daml:Class>

<daml:Class rdf:ID="<daml:Class rdf:ID="organismorganism"><rdfs:subClassOf "><rdfs:subClassOf rdf:resource="#natural"/></daml:Class>rdf:resource="#natural"/></daml:Class>

<daml:Class rdf:ID="<daml:Class rdf:ID="multi_cellmulti_cell"><rdfs:subClassOf "><rdfs:subClassOf rdf:resource="#organism"/></daml:Class>rdf:resource="#organism"/></daml:Class>

<daml:Class rdf:ID="<daml:Class rdf:ID="mono_cellmono_cell"><rdfs:subClassOf "><rdfs:subClassOf rdf:resource="#organism"/></daml:Class>rdf:resource="#organism"/></daml:Class>

<daml:Class rdf:ID="<daml:Class rdf:ID="virusvirus"><rdfs:subClassOf "><rdfs:subClassOf rdf:resource="#organism"/></daml:Class>rdf:resource="#organism"/></daml:Class>

<daml:Class rdf:ID="<daml:Class rdf:ID="body_partbody_part"><rdfs:subClassOf "><rdfs:subClassOf rdf:resource="#natural"/></daml:Class>rdf:resource="#natural"/></daml:Class>

<daml:Class rdf:ID="<daml:Class rdf:ID="tissuetissue"><rdfs:subClassOf "><rdfs:subClassOf rdf:resource="#natural"/></daml:Class>rdf:resource="#natural"/></daml:Class>

<daml:Class rdf:ID="<daml:Class rdf:ID="cell_typecell_type"><rdfs:subClassOf "><rdfs:subClassOf rdf:resource="#natural"/></daml:Class>rdf:resource="#natural"/></daml:Class>

<daml:Class rdf:ID="<daml:Class rdf:ID="artificialartificial"><rdfs:subClassOf "><rdfs:subClassOf rdf:resource="#source"/></daml:Class>rdf:resource="#source"/></daml:Class>

<daml:Class rdf:ID="<daml:Class rdf:ID="cell_linecell_line"><rdfs:subClassOf "><rdfs:subClassOf rdf:resource="#artificial"/></daml:Class>rdf:resource="#artificial"/></daml:Class>

<daml:Class rdf:ID="<daml:Class rdf:ID="other_artificial_sourceother_artificial_source"><rdfs:subClassOf "><rdfs:subClassOf rdf:resource="#artificial"/></daml:Class>rdf:resource="#artificial"/></daml:Class>


GenOntologyGenOntology

[[TermTerm]]

idid: GO:0000001: GO:0000001

namename: mitochondrion inheritance: mitochondrion inheritance

namespacenamespace: biological_process: biological_process

defdef: "The distribution of mitochondria, including the mitochondrial genome, into : "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]PMID:11389764]

exact_exact_synonymsynonym: "mitochondrial inheritance" []: "mitochondrial inheritance" []

is_ais_a: GO:0048308 ! organelle inheritance: GO:0048308 ! organelle inheritance

is_a: GO:0048311 ! mitochondrion distributionis_a: GO:0048311 ! mitochondrion distribution

[Term][Term]

id: GO:0000002id: GO:0000002

name: mitochondrial genome maintenancename: mitochondrial genome maintenance

namespace: biological_processnamespace: biological_process

def: "The maintenance of the structure and integrity of the mitochondrial genome." def: "The maintenance of the structure and integrity of the mitochondrial genome." [GOC:ai][GOC:ai]

is_a: GO:0007005 ! mitochondrion organization and biogenesisis_a: GO:0007005 ! mitochondrion organization and biogenesis


MeSHMeSH<<ConceptConcept PreferredConceptYN="N"> PreferredConceptYN="N">

<ConceptUI>M0353609</ConceptUI><ConceptUI>M0353609</ConceptUI>

<<ConceptNameConceptName>>

<String>A-23187</String><String>A-23187</String>

</ConceptName></ConceptName>

<ConceptUMLSUI>C0878412</ConceptUMLSUI><ConceptUMLSUI>C0878412</ConceptUMLSUI>

<RegistryNumber>0</RegistryNumber><RegistryNumber>0</RegistryNumber>

<SemanticTypeList><SemanticTypeList>

<SemanticType><SemanticType>

<SemanticTypeUI>T109</SemanticTypeUI><SemanticTypeUI>T109</SemanticTypeUI>

<<SemanticTypeNameSemanticTypeName>>Organic ChemicalOrganic Chemical</SemanticTypeName></SemanticTypeName>

</SemanticType></SemanticType>

<SemanticType><SemanticType>

<SemanticTypeUI>T195</SemanticTypeUI><SemanticTypeUI>T195</SemanticTypeUI>

<SemanticTypeName>Antibiotic</SemanticTypeName><SemanticTypeName>Antibiotic</SemanticTypeName>

</SemanticType></SemanticType>

</SemanticTypeList></SemanticTypeList>

<ConceptRelationList><ConceptRelationList>

<ConceptRelation RelationName="NRW"><ConceptRelation RelationName="NRW">

<Concept1UI>M0000001</Concept1UI><Concept1UI>M0000001</Concept1UI>

<Concept2UI>M0353609</Concept2UI><Concept2UI>M0353609</Concept2UI>

</ConceptRelation></ConceptRelation>

</ConceptRelationList></ConceptRelationList>

<TermList><TermList>

<Term ConceptPreferredTermYN="Y" IsPermutedTermYN="N" <Term ConceptPreferredTermYN="Y" IsPermutedTermYN="N" LexicalTag="LAB" PrintFlagYN="N" RecordPreferredTermYN="N">LexicalTag="LAB" PrintFlagYN="N" RecordPreferredTermYN="N">

<TermUI>T000001</TermUI><TermUI>T000001</TermUI>

<String>A-23187</String><String>A-23187</String>




approach?approach?


12.12. Lab session: CALD, WordNet vs. Genia, GenOntology, Lab session: CALD, WordNet vs. Genia, GenOntology, MeSHMeSH

12. Lab session: 12. Lab session: CALD, WordNet vs. CALD, WordNet vs. Genia, Genia,

GenOntology, MeSHGenOntology, MeSH

ThanksThanks

GrazieGrazie

GraciasGracias

Grasie (zeneize!)Grasie (zeneize!)

Documents

The use of ontologies in Natural Language Engineering Pattern Recognition & Artificial Intelligence group Dept. of Computation and Information System Polytechnic