29
Computer Science, building 42.1 Roskilde University Universitetsvej 1 P.O. Box 260 DK-4000 Roskilde Denmark Phone: +45 4674 2000 Fax: +45 4674 3072 www.dat.ruc.dk Ontology-based Information Retrieval PhD. Defense Henrik Bulskov

Computer Science, building 42.1 Roskilde University Universitetsvej 1 P.O. Box 260 DK-4000 Roskilde Denmark Phone: +45 4674 2000 Fax: +45 4674 3072

Embed Size (px)

Citation preview

Computer Science, building 42.1Roskilde University

Universitetsvej 1 P.O. Box 260

DK-4000 RoskildeDenmark

Phone: +45 4674 2000Fax: +45 4674 3072

www.dat.ruc.dk

Ontology-based Information Retrieval

PhD. Defense

Henrik Bulskov

Henrik Bulskov 2November 3th, 2006

Research Question

• How do we recognize concepts in information objects and queries, represent these in the information retrieval system, and use the knowledge about relations between concepts captured by ontologies in the querying process?

How to:

• Describe Recognize and map information in documents and queries into the ontologies,

• CompareImprove the retrieval process by use of similarity measures derived from knowledge about relations between concepts in ontologies, and

• Retrieve Introduce ontological indexing and ontological similarity into a realistic information retrieval scenario.

Henrik Bulskov 3November 3th, 2006

Outline

• Introduction• Ontologies

• Information Retrieval

• Knowledge Representation

• Ontological Indexing• Instantiated Ontologies• Ontological Similarity• Query Evaluation• Prototype• Conclusion and Further Work

Henrik Bulskov 4November 3th, 2006

Ontologies

• In Philosophy:“A science or study of being”“First philosophy”

• In Knowledge Engineering:“A formal, explicit specification of a shared conceptualization”

Henrik Bulskov 5November 3th, 2006

Information Retrieval Models

• Boolean Model• Vector Model• Probabilistic Model• Fuzzy Retrieval Model• Fuzzy Sets

classical set d = {t1, t2, t3}

fuzzy set dfuzzy = {1.0/t1, 0.5/t2, 0.1/t3, 0.0/t4}

Henrik Bulskov 6November 3th, 2006

Knowledge Representation

• A knowledge representation is a surrogate. Most of the things that we want to represent cannot be stored in a computer, e.g. bicycles, birthdays, motherhood, etc.

• A knowledge representation is a set of ontological commitments. Representations are imperfect approximations of the world, each attending to some things and ignoring others.

• A knowledge representation is a fragmentary theory of intelligent reasoning. To be able to reason about the things represented, the representation should also describe their behavior and intentions.

• A knowledge representation is a medium for efficient computation. Remarks on useful ways to organize information are given.

“What is a knowledge representation?” [Davis et al., 1993]

Henrik Bulskov 7November 3th, 2006

A Graphical Representation

• We aim at reasoning by means of a “nearness” principle, where increased nearness entails increased degree of similarity

• Can be based on any formalism that has a suitable network or graphical representation encompassing the semantics expressed

Henrik Bulskov 8November 3th, 2006

Representation Formalisms

• Semantic Networks• Frames• Descriptions Logic• Lattice Algebra• OntoLog

A lattice-algebra with “attribution” (by means of Peirce Product)

Henrik Bulskov 9November 3th, 2006

OntoLog - A Lattice-algebraic Approach

• Compound concepts are built from • Atomic concepts of the ontology, and• Attribution:

• Attribute features using semantic relations like• WRT: with respect to • CHR: characterized by (property ascription) • CBY: caused by • PNT: patient of act or process • LOC: location, position • ...

• Concept Examples: • “The black cat”

• cat[CHR: black]

• “The noise caused by the black dog”• noise[CBY: dog[CHR: black]]

Henrik Bulskov 10November 3th, 2006

Mapping OntoLog Expressions into the Ontology

Henrik Bulskov 11November 3th, 2006

Ontological Resource• WordNet• A large lexical database organized in terms of

meanings.

• Nouns, Adjectives, Adverbs, and Verbs

• Synonym words are grouped into synset{car, auto, automobile, machine, motorcar}

{food, nutrient}{police, police force, constabulary, law}

• Number of words, synsets, and senses

Henrik Bulskov 12November 3th, 2006

Ontological Indexing

• Describing Content• Conventional approach: bag of keywords• Aim: Moving from keywords Concepts

• Description Properties• Fidelity

Ability to represent the content

• ExhaustivityDegree of recognized concepts

• SpecificityGeneric level of the concepts

• Level of abstractionComplexity of descriptions

Henrik Bulskov 13November 3th, 2006

Descriptions Example

“physical well-being caused by a balanced diet”

{{well-being[CHR:physical]}, {diet[CHR:balanced]}}

well-being[CHR:physical, CBY:diet[CHR:balanced]]

{{“physical”, “well-being”}, {“balanced”, “diet”}}

{“physical”, “well-being”, “balanced”, “diet”}

Henrik Bulskov 14November 3th, 2006

Simple Natural Language Processing

• Part-of-speech tagging

Example: “The black book on the small table”

The/DT black/JJ book/NN on/IN the/DT small/JJ table/NN

• Noun phrase recognition

<PP [NP The/DT black/JJ book/NN] on/IN [NP the/DT small/JJ table/NN]

>

• Extracting Descriptions

book[CHR:black, LOC:table[CHR:small]]

Henrik Bulskov 15November 3th, 2006

Word Sence Disambiguation

• Guessing (about 45% correct senses)

• Frequencies (increased from 45% to 69%)

• Selectional Restrictions

e,x,y(Drinking(e) ^ Agent(e,x) ^ Theme(e,y)) ISA(y, DrinkableThing)

Example: “He drank gin”

{gin} (strong liquor flavored with juniper berries) {snare, gin, noose} (a trap for birds or small mammals; … a slip noose) {cotton gin, gin} (a machine that separates the seeds … cotton fibers) {gin, gin rummy, knock rummy} (a form of rummy in which a player …)

Henrik Bulskov 16November 3th, 2006

Relational Connections

• Noun phrasesExample:

“Physical well-being caused by a balanced diet”

well-being[CHR:physical, CBY:diet[CHR:balanced]]

• Relations derived from verbsThe CBY relation between “well-being” and “diet”

• Relations inside simple NP’sThe CHR between “diet” and “balanced”

Henrik Bulskov 17November 3th, 2006

Relational Connections, cont.

• PremodifierExamples: ”black book“ and “criminal lawyer”book[CHR:black] is correct, butlawyer[CHR:criminal] is not.

Use knowledge from ontologeis to differentiate between descriptive and relational adjective, and use WRT relation for the latter

lawyer[WRT:criminal]

• Preposition“The black book on the small table”

book[CHR:black, LOC:table[CHR:small]]“The meeting on Monday”

meeting[TMP:Monday]

Henrik Bulskov 18November 3th, 2006

Compound Descriptions

Henrik Bulskov 19November 3th, 2006

General Ontologies

• Modeling of concepts in a generative ontology based on different conceptualization and dictionaries.

• We assume a set of atomic concepts A and a set of semantic relations R

• We define here the set of well-formed terms L of the OntoLog language is recursively defined as follows:

if x isa y then x ≤ yif x[…] ≤ y[…] then also

x[…, r:z] ≤ y[…], andx[…, r:z] ≤ y[…, r:z],

if x ≤ y then alsoz[…, r:x] ≤ z[…, r:y]

Definition: O = (L, ≤, R)

Henrik Bulskov 20November 3th, 2006

Ontological Similarity

• We aim at reasoning by means of a “nearness” principle, where increased nearness entails increased degree of similarity

• Properties• Commonality, Difference, Identity

• Generalization• sim(animal,poodle) ≠ sim(poodle,animal)

• Depth

• Multiple-Paths

Henrik Bulskov 21November 3th, 2006

Ontological Similarity Measures

• Knowledge-based• Shortest Path

• Corpus-based• Co-occurrences

• Integrated • Information Content

• probability of encountering an instance: p(c) = freq(c)/N

• Shared Nodes (on instantiated ontologies)

Henrik Bulskov 22November 3th, 2006

Shared Nodes

• A simplified “all-possible-paths” approach where shared nodes are nodes that are upwards reachable from both concepts, and where similarity is dependent on the number of shared nodes

Henrik Bulskov 23November 3th, 2006

Shared Nodes, cont.

• Nodes shared by grey animals

Henrik Bulskov 24November 3th, 2006

Shared Nodes, cont.

sim(dog[CHR:gray ], cat[CHR:gray ]) > sim(dog[CHR:gray ], dog[CHR:large ])

Counterintuitive: • concept-inclusion (ISA) should have higher importance than characterized-by (CHR) property

Henrik Bulskov 25November 3th, 2006

Weighted Shared Nodes• Not all nodes are equally important

sim(dog[CHR:gray ], cat[CHR:gray ]) > sim(dog[CHR:gray ], dog[CHR:large ])

solution: attach weights in [0,1] to relations so the nodes upwards reachable from x: (x) becomes a fuzzy

set

Henrik Bulskov 26November 3th, 2006

Query Evaluation - Simple Fuzzy Retrieval

• Assign to each pair (di,cj) a value which define the relevance of cj to di

• Compute the relevance of documents to a given query as

d1 = {1/c1 + 1/c2 + 0/c3}

d2 = {1/c1 + 0/c2 + 1/c3} andQ = {c1, c2}

d3 = {0/c1 + 0/c2 + 1/c3}

Qc Q

QdQcQQ c

cc

Q

QdddRSV

)(

))(),(min(

||

||)()(

Henrik Bulskov 27November 3th, 2006

Query Evaluation - Hierarchical Aggregation

<q1(d),

<q2(d), q3(d),

<q4(d), q5(d), q6(d) : M3 : K3>

: M2 : K2>,

: M1 : K1>

M1

K1

q1(d) M2

K2

M3

K3

q2(d) q3(d)

q4(d) q5(d) q6(d)

Henrik Bulskov 28November 3th, 2006

Prototype – Using WordNet

• Adjectives are not connected by ISA

Henrik Bulskov 29November 3th, 2006

Conclusion

• Describe• Properties of descriptions• Shallow NLP analysis • Instantiated Ontologies

• Compare• Shared Nodes• Weighted Shared Nodes

• Retrieve• Semantic Expansion• Hierarchical Query Evaluation

• Prototype• Theoretical foundation