Upload
mailyash89
View
219
Download
0
Embed Size (px)
Citation preview
8/3/2019 Information Searching
1/30
8/8/2011 1
Information Searching
8/3/2019 Information Searching
2/30
8/8/2011 2
Information Search
Traditional Search
Web Search
Metadata based Search
Semantic Search
8/3/2019 Information Searching
3/30
8/8/2011 3
Traditional Search
A collection of documents is a set of documents related to a specificcontext of interest
Indexing process is applied to full text of documents
8/3/2019 Information Searching
4/30
8/8/2011 4
Classical Search
8/3/2019 Information Searching
5/30
8/8/2011 5
Web Information SearchingSearch Engine Architecture
8/3/2019 Information Searching
6/30
8/8/2011 6
Web Information Searching
Web Searching & Information Retrieval, IEEE Web Engineering,2004
Search engines index each web page by representing it by a set ofweighted keywords
Using robots or spiders that crawl through the web search engines
pick up useful pages Indexing of these pages includes:
Removing all frequent or non-significant words (stop words: and, be)
Stemming removes all the derivational suffixes (retains root: thinking,thinkers, thinks)
Pages found are represented by a set of weighted keywords
8/3/2019 Information Searching
7/30
8/8/2011 7
Crawler/ Robot/ Spider
Crawler is a program controlled by a crawl module that browses theweb
Collects documents by recursively fetching links from a set of startpages, the received pages are (or parts) are compressed and stored
in page repository URL and their links form web graph, which can be used by crawler
control module to decide further crawling
To save space docID represents pages in the index
Indexer processes pages collected by crawler.
It decides which pages to index, duplicate documents are discarded Inverted index is built which contains for each word a sorted list of
couples (such as docID and position in the document)
8/3/2019 Information Searching
8/30
8/8/2011 8
Query Engine
Query Engine processes user queries and returns matchinganswers using ranking algorithm
Algorithm produces numerical score expressing importance of theanswer with respect to query
Utility data structures contains lists of related pages, which canfacilitate search
Various query independent as well as dependent data is used todecide ranking (data of modification, site, number of links to otherpages or actual content of documents)
Query dependent criteria include cosine measure of similarity invector space model
8/3/2019 Information Searching
9/30
8/8/2011 9
TIR(1)
Classical Models: Boolean, Vector, Probabilistic Vector model is the most popular
Documents and queries as vectors
8/3/2019 Information Searching
10/30
8/8/2011 10
TIR (2)
Weighting algorithm : tf-idf (term frequency- inverse document frequency weight wijcorresponding to the i
thcomponent of the document djvectorrepresentation is given by
wij= tfij * idfi Where tf
ij= f
ij/ max
l(f
lj) where maximum is computed over all terms
mentioned in the document dj. idfi is inverse document frequency for ki is given by idfi= log ( N /ni). The relevance ranking is Sim (d,j) = cos () = (d.q) / |d| |q| Thus very frequent terms receive a low weight, while uncommon terms
appearing in few documents receive a high weight
the assumption that the index terms are independent.
8/3/2019 Information Searching
11/30
8/8/2011 11
Web Searching (2)
Weighing procedure considers:
If a term appears more frequently than other terms, associated weight canbe increased
If term appears within many pages then, its weight would be decreased(may not be useful in discriminating items)
Usually greater weights are assigned to short pages than longer ones
Inverted file is updates such that for each keyword, the system can find alist of all web pages( with associated weight) inderxed under this term
Degree of similarity can be calculated using this data
8/3/2019 Information Searching
12/30
8/8/2011 12
Web searching (3)
To improve search:
Giving more credit to words appearing in title field
Considering distance between search keywords appearing within apage
Using different models for assigning weights: probabilistic orlanguage based
8/3/2019 Information Searching
13/30
8/8/2011 13
TIR (3)
Problems with TIR
Keyword based search
Measure of relevancy of retrieved document
8/3/2019 Information Searching
14/30
8/8/2011 14
Semantic Metadata
Data which may be associated explicitly or implicitly with a givenpiece of content and whose relevance for that content is determinedby its ontological position( its context) with the domain of knowledge
Helps in classification, high precision searching
Named entity recognition involves finding items of potential interestwith a piece of text (person, place, thing, event)
these are stored in the ontology
Metadata is a snapshot of the documents relevant information
Metadata contained within the snapshot references the instances of
the named entities, which are stored in the ontology
8/3/2019 Information Searching
15/30
8/8/2011 15
Relevant Information
8/3/2019 Information Searching
16/30
8/8/2011 16
Types of Specs and Standards (or MetaModels)
Domain Independent: (Meta Content Framework), RDF, MOF
(object facility), DublinCore
Media Specific: MPEG4, MPEG7, VoiceXML
Domain/Industry Specific (metamodels): MARC, DCMI, METS(Library), FGDC and UDK (Geographic), NewsML (News), PRISM
(Publishing Requirements for Industry Standard Metadata)
Application Specific: ICE Information & Content Exchange (communication
between sender and receiver)
Exchange/Sharing: XCM, XMI
Other Models: RDFS, namespaces, ontologies, (DAML, OIL)
8/3/2019 Information Searching
17/30
8/8/2011 17
Dublin Core Metadata Initiative DCMI1995-96
Simple element set designed for domain independent resource
description
15 elements are defined by this standard
International, inter-discipline, W3C community consensus Semantic interface among resource description communities (very
limited form of semantics)
8/3/2019 Information Searching
18/30
8/8/2011 18
DCMI (2002)http://dublincore.org/documents/usageguide/elements.shtml
Title: name given to the resource
Contributor: entity responsible for making contributions to thecontent of resource
Creator:
Publisher: an entity responsible for making resource available Subject & keywords: topic of content of the resource
Description: an account of the content of resource
Format : data representation of the source
Resource identifier: unambiguous reference
Language
Rights : copyright notice/statement
Date, type, source, Relation, Coverage
http://dublincore.org/documents/usageguide/elements.shtmlhttp://dublincore.org/documents/usageguide/elements.shtml8/3/2019 Information Searching
19/30
8/8/2011 19
DDI Data Documentation Initiative
DDI Data Documentation Initiative:
Technical documentation of social, behavioural, and economic data
SDMX Statistical Data and Metadata Exchange: Used bystatisticians for exchange of time series data
8/3/2019 Information Searching
20/30
8/8/2011 20
Creating and Serving Metadata to Power the Life-cycle of Content
Where isthe
content?Whose is
it?
Produce
Aggregate
What is thiscontentabout?
Catalog/
Index
Whatother
content isit related
to?
Integrate
Syndicate
What is theright contentfor this user?
Personalize
What is thebest way to
monetize thisinteraction?
Interactive
Marketing
Broadcast,Wireline,Wireless,Interactive TV
Taalee Semantic MetaBase
Taalee Content ApplicationsTaalee Infrastructure Services
8/3/2019 Information Searching
21/30
8/8/2011 21
Intelligent Search
8/3/2019 Information Searching
22/30
8/8/2011 22
Intelligent Search using Ontologies
Query
Mediator1: Ontology 1
Mediator2: Ontology 2
Ontology
User
Mediator3: Ontology 3
Answer
8/3/2019 Information Searching
23/30
8/8/2011 23
SWIR
Use of Vector Space Model for SW : documents at semantic levelcould be represented as vectors in a hyperspace defined by the setof all ontology concepts
Weight of concept is relative importance of that concept
SWIR needs Good domain ontology
Understanding semantic relationships among ontological concepts
8/3/2019 Information Searching
24/30
8/8/2011 24
SWIR(3)
Weights are assigned to links based on certain properties of theontology representing the strength of the relation
Spread activation technique is used to find related concepts in the
ontology given some initial set of concepts and initial weights
8/3/2019 Information Searching
25/30
8/8/2011 25
SWIR (4) : Weighting Algorithm
In traditional IR tf-idfstrategy, the first measure gives the degree ofsimilarity between two related concept instances in a relation andthe second measure gives the specificity of the concept relation
This Cluster measure for concept instances Cjand Ckis given by:W ( Cj, Ck) = { nijk/ nij }
Where nijrepresents that concepts Cjand Ciare related and nijkrepresentsthat both the concepts Cjand Ckare related to concept Ci. Therefore (Cj, Ck)represents percentage of concepts that Ck is related to that Cjis also related
This particular measure reflects the fact that concepts sharing common
relations are semantically similar
8/3/2019 Information Searching
26/30
8/8/2011 26
SWIR (5)
The Specificity measure is given by:
W (Cj, Ck) = 1/ n k Where nkis the number of instances of given relation type that have kas its
destination node
The actual measure is the product of cluster and specificitymeasures
8/3/2019 Information Searching
27/30
8/8/2011 27
SWIR (6): Spread Activation Algorithm
Given an initial set of concepts, the algorithm obtains a set of closelyrelated (semantically related) concepts by navigating through thelinked concepts in the graph
The algorithm has as a starting point, an initial set of instances in theontology with each having an initial activation value
Constrained Spread Activation applies constraints like maximumpath length, fan-out etc to propagation
8/3/2019 Information Searching
28/30
8/8/2011 28
Conclusion (1)
Traditional information retrieval : small, static, homogeneous,centrally located, monolingual document collections
Web information retrieval : huge volumes of data which is volatile,heterogeneous, distributed and multilingual
Semantic web information retrieval is ontology based intelligentinformation retrieval
Various semantic search strategies are explored
Two major differences
Keyword vs. concept
Response time a part of relevancy measure Most successful semantic search algorithms are the Vector Space
Model and the Hybrid approach which uses classical technique withspread activation algorithm
8/3/2019 Information Searching
29/30
8/8/2011 29
Conclusion (2)
concepts which form the basis of the semantic domain model arenot orthogonal. This issue can be addressed by reassigning theweights to concept links based on the relationship graph of theontology concepts
The spread activation algorithm has been used to deduce therelationships based on given set of relationships
The SIR has been visualized as 4 layer process; keywords, indexedkeywords, semantic concepts, relationships
8/3/2019 Information Searching
30/30
8/8/2011 30
References
Berners-Lee T., Hendler J., Lassila O., The Semantic Web, ScientificAmerican. 2001, 284: 35-43
R.Baeza-Yates, B.Ribeiro-Neto, Modern Information Retrieval, 1stedition, Addison-Wesley, 1999
Pokorny J., Web Searching and Information Retrieval, Web Engineering,July/August 2004, 43-48