Information Searching

8/3/2019 Information Searching

1/30

8/8/2011 1

Information Searching


2/30

8/8/2011 2

Information Search

Traditional Search

Web Search

Metadata based Search

Semantic Search


3/30

8/8/2011 3

Traditional Search

A collection of documents is a set of documents related to a specificcontext of interest

Indexing process is applied to full text of documents


4/30

8/8/2011 4

Classical Search


5/30

8/8/2011 5

Web Information SearchingSearch Engine Architecture


6/30

8/8/2011 6

Web Information Searching

Web Searching & Information Retrieval, IEEE Web Engineering,2004

Search engines index each web page by representing it by a set ofweighted keywords

Using robots or spiders that crawl through the web search engines

pick up useful pages Indexing of these pages includes:

Removing all frequent or non-significant words (stop words: and, be)

Stemming removes all the derivational suffixes (retains root: thinking,thinkers, thinks)

Pages found are represented by a set of weighted keywords


7/30

8/8/2011 7

Crawler/ Robot/ Spider

Crawler is a program controlled by a crawl module that browses theweb

Collects documents by recursively fetching links from a set of startpages, the received pages are (or parts) are compressed and stored

in page repository URL and their links form web graph, which can be used by crawler

control module to decide further crawling

To save space docID represents pages in the index

Indexer processes pages collected by crawler.

It decides which pages to index, duplicate documents are discarded Inverted index is built which contains for each word a sorted list of

couples (such as docID and position in the document)


8/30

8/8/2011 8

Query Engine

Query Engine processes user queries and returns matchinganswers using ranking algorithm

Algorithm produces numerical score expressing importance of theanswer with respect to query

Utility data structures contains lists of related pages, which canfacilitate search

Various query independent as well as dependent data is used todecide ranking (data of modification, site, number of links to otherpages or actual content of documents)

Query dependent criteria include cosine measure of similarity invector space model


9/30

8/8/2011 9

TIR(1)

Classical Models: Boolean, Vector, Probabilistic Vector model is the most popular

Documents and queries as vectors


10/30

8/8/2011 10

TIR (2)

Weighting algorithm : tf-idf (term frequency- inverse document frequency weight wijcorresponding to the i

thcomponent of the document djvectorrepresentation is given by

wij= tfij * idfi Where tf

ij= f

ij/ max

l(f

lj) where maximum is computed over all terms

mentioned in the document dj. idfi is inverse document frequency for ki is given by idfi= log ( N /ni). The relevance ranking is Sim (d,j) = cos () = (d.q) / |d| |q| Thus very frequent terms receive a low weight, while uncommon terms

appearing in few documents receive a high weight

the assumption that the index terms are independent.


11/30

8/8/2011 11

Web Searching (2)

Weighing procedure considers:

If a term appears more frequently than other terms, associated weight canbe increased

If term appears within many pages then, its weight would be decreased(may not be useful in discriminating items)

Usually greater weights are assigned to short pages than longer ones

Inverted file is updates such that for each keyword, the system can find alist of all web pages( with associated weight) inderxed under this term

Degree of similarity can be calculated using this data


12/30

8/8/2011 12

Web searching (3)

To improve search:

Giving more credit to words appearing in title field

Considering distance between search keywords appearing within apage

Using different models for assigning weights: probabilistic orlanguage based


13/30

8/8/2011 13

TIR (3)

Problems with TIR

Keyword based search

Measure of relevancy of retrieved document


14/30

8/8/2011 14

Semantic Metadata

Data which may be associated explicitly or implicitly with a givenpiece of content and whose relevance for that content is determinedby its ontological position( its context) with the domain of knowledge

Helps in classification, high precision searching

Named entity recognition involves finding items of potential interestwith a piece of text (person, place, thing, event)

these are stored in the ontology

Metadata is a snapshot of the documents relevant information

Metadata contained within the snapshot references the instances of

the named entities, which are stored in the ontology


15/30

8/8/2011 15

Relevant Information


16/30

8/8/2011 16

Types of Specs and Standards (or MetaModels)

Domain Independent: (Meta Content Framework), RDF, MOF

(object facility), DublinCore

Media Specific: MPEG4, MPEG7, VoiceXML

Domain/Industry Specific (metamodels): MARC, DCMI, METS(Library), FGDC and UDK (Geographic), NewsML (News), PRISM

(Publishing Requirements for Industry Standard Metadata)

Application Specific: ICE Information & Content Exchange (communication

between sender and receiver)

Exchange/Sharing: XCM, XMI

Other Models: RDFS, namespaces, ontologies, (DAML, OIL)


17/30

8/8/2011 17

Dublin Core Metadata Initiative DCMI1995-96

Simple element set designed for domain independent resource

description

15 elements are defined by this standard

International, inter-discipline, W3C community consensus Semantic interface among resource description communities (very

limited form of semantics)


18/30

8/8/2011 18

DCMI (2002)http://dublincore.org/documents/usageguide/elements.shtml

Title: name given to the resource

Contributor: entity responsible for making contributions to thecontent of resource

Creator:

Publisher: an entity responsible for making resource available Subject & keywords: topic of content of the resource

Description: an account of the content of resource

Format : data representation of the source

Resource identifier: unambiguous reference

Language

Rights : copyright notice/statement

Date, type, source, Relation, Coverage
http://dublincore.org/documents/usageguide/elements.shtmlhttp://dublincore.org/documents/usageguide/elements.shtml


19/30

8/8/2011 19

DDI Data Documentation Initiative

DDI Data Documentation Initiative:

Technical documentation of social, behavioural, and economic data

SDMX Statistical Data and Metadata Exchange: Used bystatisticians for exchange of time series data


20/30

8/8/2011 20

Creating and Serving Metadata to Power the Life-cycle of Content

Where isthe

content?Whose is

it?

Produce

Aggregate

What is thiscontentabout?

Catalog/

Index

Whatother

content isit related

to?

Integrate

Syndicate

What is theright contentfor this user?

Personalize

What is thebest way to

monetize thisinteraction?

Interactive

Marketing

Broadcast,Wireline,Wireless,Interactive TV

Taalee Semantic MetaBase

Taalee Content ApplicationsTaalee Infrastructure Services


21/30

8/8/2011 21

Intelligent Search


22/30

8/8/2011 22

Intelligent Search using Ontologies

Query

Mediator1: Ontology 1


Ontology

User


Answer


23/30

8/8/2011 23

SWIR

Use of Vector Space Model for SW : documents at semantic levelcould be represented as vectors in a hyperspace defined by the setof all ontology concepts

Weight of concept is relative importance of that concept

SWIR needs Good domain ontology

Understanding semantic relationships among ontological concepts


24/30

8/8/2011 24

SWIR(3)

Weights are assigned to links based on certain properties of theontology representing the strength of the relation

Spread activation technique is used to find related concepts in the

ontology given some initial set of concepts and initial weights


25/30

8/8/2011 25

SWIR (4) : Weighting Algorithm

In traditional IR tf-idfstrategy, the first measure gives the degree ofsimilarity between two related concept instances in a relation andthe second measure gives the specificity of the concept relation

This Cluster measure for concept instances Cjand Ckis given by:W ( Cj, Ck) = { nijk/ nij }

Where nijrepresents that concepts Cjand Ciare related and nijkrepresentsthat both the concepts Cjand Ckare related to concept Ci. Therefore (Cj, Ck)represents percentage of concepts that Ck is related to that Cjis also related

This particular measure reflects the fact that concepts sharing common

relations are semantically similar


26/30

8/8/2011 26

SWIR (5)

The Specificity measure is given by:

W (Cj, Ck) = 1/ n k Where nkis the number of instances of given relation type that have kas its

destination node

The actual measure is the product of cluster and specificitymeasures


27/30

8/8/2011 27

SWIR (6): Spread Activation Algorithm

Given an initial set of concepts, the algorithm obtains a set of closelyrelated (semantically related) concepts by navigating through thelinked concepts in the graph

The algorithm has as a starting point, an initial set of instances in theontology with each having an initial activation value

Constrained Spread Activation applies constraints like maximumpath length, fan-out etc to propagation


28/30

8/8/2011 28

Conclusion (1)

Traditional information retrieval : small, static, homogeneous,centrally located, monolingual document collections

Web information retrieval : huge volumes of data which is volatile,heterogeneous, distributed and multilingual

Semantic web information retrieval is ontology based intelligentinformation retrieval

Various semantic search strategies are explored

Two major differences

Keyword vs. concept

Response time a part of relevancy measure Most successful semantic search algorithms are the Vector Space

Model and the Hybrid approach which uses classical technique withspread activation algorithm


29/30

8/8/2011 29

Conclusion (2)

concepts which form the basis of the semantic domain model arenot orthogonal. This issue can be addressed by reassigning theweights to concept links based on the relationship graph of theontology concepts

The spread activation algorithm has been used to deduce therelationships based on given set of relationships

The SIR has been visualized as 4 layer process; keywords, indexedkeywords, semantic concepts, relationships


30/30

8/8/2011 30

References

Berners-Lee T., Hendler J., Lassila O., The Semantic Web, ScientificAmerican. 2001, 284: 35-43

R.Baeza-Yates, B.Ribeiro-Neto, Modern Information Retrieval, 1stedition, Addison-Wesley, 1999

Pokorny J., Web Searching and Information Retrieval, Web Engineering,July/August 2004, 43-48