Database-Inspired Database-Inspired SearchSearch
David Konopnicki and Oded David Konopnicki and Oded ShmueliShmueli
IBM Haifa TechnionIBM Haifa Technion
•Started as “Jerry and David's Guide to the World Wide Web”•Funded in April 1995 with a initial investment of $2 million
W3QL – W3QS: A database W3QL – W3QS: A database approach to Web dataapproach to Web data
A way to “improve” search resultsA way to “improve” search results A database language for searching the webA database language for searching the web Using full-text indexes as starting pointsUsing full-text indexes as starting points Had conditions on “semi-structured” formats:Had conditions on “semi-structured” formats:
n1.format eq “Latex File” && n1.section[3].content n1.format eq “Latex File” && n1.section[3].content =~ /zoo/=~ /zoo/
Would record form fillings and re-execute Would record form fillings and re-execute them automaticallythem automatically
Basically, a way to define personal crawlersBasically, a way to define personal crawlers
Contemporary SystemsContemporary Systems
First generation languages: WebSQL First generation languages: WebSQL (Mihaila, (Mihaila, Mendelzon and Milo) Mendelzon and Milo)
Second generation languages: Weblog Second generation languages: Weblog (Lakshmanan, Sadri, and Subramania)(Lakshmanan, Sadri, and Subramania) , Florid , Florid (Ludascher, Himmeroder, Lausen, May and (Ludascher, Himmeroder, Lausen, May and Schlepphorst)Schlepphorst)
Web restructuring languages: WebOQL Web restructuring languages: WebOQL (Arocena and Mendelzon)(Arocena and Mendelzon) , StruQL , StruQL (Fernandez, (Fernandez, Florescu, Kang, Levy and Suciu)Florescu, Kang, Levy and Suciu), Araneus , Araneus (Mecca, (Mecca, Atzeni, Masci, Merialdo and Sindoni)Atzeni, Masci, Merialdo and Sindoni)
Lorel Lorel (Abiteboul, Quass, McHugh, Widom and (Abiteboul, Quass, McHugh, Widom and Wiener)Wiener)
Present TrendsPresent Trends Certainly, nowadays search engines are bigger Certainly, nowadays search engines are bigger
and faster and more accurateand faster and more accurate A few new features:A few new features:
Is searching the web easier?Is searching the web easier?Clusty Teoma
Limitations Remain the Limitations Remain the SameSame
Visually parsing resultsVisually parsing results Search in contextSearch in context Searching beyond the first page of Searching beyond the first page of
resultsresults Integrated search from my desktop, Integrated search from my desktop,
my enterprise and on to the worldmy enterprise and on to the world
Visually Parsing ResultsVisually Parsing Results
What is best?
Lots of times we search for real-world objects not documents
Merging Documents and Merging Documents and Object RetrievalObject Retrieval
Document
Person
Need to understand objects,attributes etc…
Search Only the First Page Search Only the First Page of Resultsof Results
From a recent study on 12,500 queries:From a recent study on 12,500 queries: 73.9% of Ask Jeeves first page results were unique 73.9% of Ask Jeeves first page results were unique
to Ask Jeeves to Ask Jeeves 71.2% of Yahoo first page results were unique to 71.2% of Yahoo first page results were unique to
Yahoo Yahoo 70.8% of MSN search first page results were 70.8% of MSN search first page results were
unique to MSN search unique to MSN search 66.4% of Google first page results were unique to 66.4% of Google first page results were unique to
Google Google Need an automated way to search beyond the Need an automated way to search beyond the
first page on several search engines first page on several search engines simultaneouslysimultaneously
Full-text indexes are just starting points
Desktop SearchDesktop Search
Quite different than Quite different than web web searchsearch
No links - cannot use No links - cannot use link analysislink analysis
Information discoveryInformation discoveryversus locating versus locating informationinformation
Enterprise SearchEnterprise Search
Quite different too:Quite different too: Data integration from Data integration from
lots of systemslots of systems Critical intranet Critical intranet
serviceservice IBM Intranet SearchIBM Intranet Search
10,000 websites10,000 websites 6 million indexed 6 million indexed
documents documents A new product called A new product called
OmniFindOmniFind
Search Architectures in Search Architectures in the Enterprisethe Enterprise
Applications Search Services Content Sources
Enterprise SearchEnterprise SearchE-mail Systems
Content Servers
PortalServers
CRMSystems
Intranet SearchIntranet Search
Employee PortalsEmployee Portals
Employee DirectoriesEmployee Directories
Corporate Info &Commerce SearchCorporate Info &
Commerce Search
Customer ServicesCustomer Services
Sales Force InfoCenter
Sales Force InfoCenter
Collections
E-mail Systems
Web Servers News Servers
Content Servers
FileServers
PortalServers
CRMSystems
Directory Servers
Information integration without a schema!Information integration without a schema!
Really ?!Really ?!What about schema mappings, joins…What about schema mappings, joins…
An Example: DB2 Crawling An Example: DB2 Crawling in OmniFindin OmniFind
For every table, For every table, select fields: For each select fields: For each field, define whether field, define whether it should be full-text it should be full-text searchable, searchable, if it should support if it should support range conditions etc…range conditions etc…
Full Boolean Full Boolean operations are operations are supportedsupported
The next frontier: The next frontier: Fast index building!Fast index building!
UIMA: UIMA: UUnstructured nstructured IInformation nformation MManagement anagement
AArchitecturerchitecture An open architectureAn open architecture A software framework for processing A software framework for processing
unstructured informationunstructured information Plug-n-Play with back-end Search Plug-n-Play with back-end Search
Technologies Technologies Freely Available on IBM AlphaWorksFreely Available on IBM AlphaWorks
UIMA’s Basic Building Blocks UIMA’s Basic Building Blocks
are are AnnotatorsAnnotators
FredFred isis thetheCenterCenter CEOCEO ofof
OrganizationOrganizationPersonPerson
CeoOfCeoOf
Arg2:OrgArg2:OrgArg1:PersonArg1:Person
PPPPVPVPNPNPParserParser
Named EntityNamed Entity
RelationshipRelationship
CenterCenter MicrosMicros
CAS
Collection Processing Engine (CPE)Collection Processing Engine (CPE)
CAS ConsumerCAS Consumer
Aggregate Analysis EngineAggregate Analysis Engine
UIMA Component UIMA Component Architecture from “Source Architecture from “Source
to Sink”to Sink”
CAS ConsumerCAS Consumer
CAS ConsumerCAS Consumer
OntologiesOntologies
IndicesIndices
DBsDBs
KnowledgeBases
KnowledgeBases
Collection
Reader
Collection
ReaderText, Chat,
Email, Audio, Video
Text, Chat, Email, Audio,
Video
Analysis EngineAnalysis Engine
AnnotatorAnnotator
Analysis EngineAnalysis Engine
AnnotatorAnnotator
CASCAS
CASCAS
CASCAS
Future Search Future Search Integration ServiceIntegration Service
RequirementsRequirements Index IntegrationIndex Integration Object AwareObject Aware
(“schema”)(“schema”) Correlation Correlation
AwareAware(“flexible” joins)(“flexible” joins)
Context AwareContext Aware(“language”)(“language”) Enterprise
Index 3
EnterpriseIndex 2
DesktopIndex
EnterpriseIndex 1
WebIndex 4
WebIndex 3
WebIndex 2
WebIndex 1
Search Integration Service
Search Integration Services Search Integration Services CapabilitiesCapabilities
Need APIs for querying and controlNeed APIs for querying and control Control capabilitiesControl capabilities
Specifying the number of results, result chunksSpecifying the number of results, result chunks Total size of results Total size of results Degree of validity, recency, trust, security-Degree of validity, recency, trust, security-
level…level… Time constraints, cost constraints, privacy Time constraints, cost constraints, privacy
constraints, security constraintsconstraints, security constraints May specify tradeoffsMay specify tradeoffs
Semantic capabilities: APIsSemantic capabilities: APIs Relevant ontologiesRelevant ontologies Description of resourcesDescription of resources
A Changing LandscapeA Changing Landscape
Search Integration ServicesSearch Integration Services Semantic web capabilitiesSemantic web capabilities Technologies for Supporting Technologies for Supporting
Comprehensive Search: Comprehensive Search: XML searchXML search NLNL annotation servers annotation servers collaborative bookmarks collaborative bookmarks domain-specific servicesdomain-specific services
What kind of Applications What kind of Applications are we considering?are we considering?
Generally involves a Generally involves a comprehensivecomprehensive answer to answer to a a questionquestion
Not the kind you can perform by viewing a Not the kind you can perform by viewing a single result page – although these are very single result page – although these are very importantimportant
Very time consuming with current toolsVery time consuming with current tools May involve public and proprietary informationMay involve public and proprietary information May involve information from various sourcesMay involve information from various sources May involve personal informationMay involve personal information May involve payment for certain resourcesMay involve payment for certain resources May be time constrainedMay be time constrained May be of May be of adjustableadjustable levels of dependability, levels of dependability,
clarity, recencyclarity, recency
Kinds of QuestionsKinds of Questions InformationalInformational: U.S. educational spending in : U.S. educational spending in
cities with population of at least one millioncities with population of at least one million RecommendationRecommendation: What treatment is : What treatment is
recommended for recommended for XX TechnicalTechnical: detailed techniques for water : detailed techniques for water
purificationpurification WorkflowWorkflow: How do I organize a trip to : How do I organize a trip to YY: visa, : visa,
flights, vaccinations, money exchange, cellular flights, vaccinations, money exchange, cellular service, consulate, emergenciesservice, consulate, emergencies
CompositionalCompositional: How do I perform a task : How do I perform a task electronically by composing various serviceselectronically by composing various services
These are difficult to answer with current toolsThese are difficult to answer with current tools
Towards a Towards a Comprehensive PlatformComprehensive Platform
A A languagelanguage and a and a systemsystem supporting it supporting it Why an additional language?Why an additional language?
To take advantage of a To take advantage of a collectioncollection of of sophisticated services – search engines, sophisticated services – search engines, semantics, collaborative tools, advanced semantics, collaborative tools, advanced techniques …techniques …
To provide a To provide a contextcontext to search services to search services To enable better result To enable better result presentationpresentation services services To enable To enable personalizationpersonalization of the task at hand of the task at hand When required, look at ‘When required, look at ‘rawraw’ data rather than ’ data rather than
only derived productsonly derived products To enable To enable optimizationoptimization
Search Integration System
Full Text Search
Search & Control
XML & DB Search
Semantic Sources
Desktop, Enterprise, Web Search
P2P, RSS, BLOG, Wikis search
Files,Databases
Semantic KB,Semantic search
engines
NeighborhoodQuerying,Ranking,
Preferences…
Annotations, NLA of documents
Natural Language Analysis of Queries
Semantic Web: Search and Semantic Web: Search and IntegrationIntegration
Look at Look at mixed resourcesmixed resources – involving – involving traditional as well as semantic layers traditional as well as semantic layers (annotation).(annotation).
Search the semantic web (as in Search the semantic web (as in SwoogleSwoogle)) Use ontologies to resolve ambiguitiesUse ontologies to resolve ambiguities Include reasoning capabilitiesInclude reasoning capabilities Use various measures for semantic proximityUse various measures for semantic proximity Combine information from multiple sources and Combine information from multiple sources and
resolve conflicts (resolve conflicts (trust, easier for intranetstrust, easier for intranets)) Use ontologies to organize results in Use ontologies to organize results in human human
readablereadable form form Supply Supply explanationsexplanations – how is information deduced – how is information deduced
Semantic Web: Search and Semantic Web: Search and IntegrationIntegration
Search Search semantic data (KB)semantic data (KB) to obtain to obtain access to described traditional access to described traditional resources (as in resources (as in TAPTAP)) Resolve ambiguities at the data levelResolve ambiguities at the data level Deduce keywords for traditional search Deduce keywords for traditional search
engines to obtain additional informationengines to obtain additional information Examine likely sources (e.g., IMDB)Examine likely sources (e.g., IMDB) ContinueContinue further exploration of further exploration of
described resourcesdescribed resources
Swoogle (extracted from Swoogle (extracted from the site)the site)
Swoogle is a crawler-based indexing and Swoogle is a crawler-based indexing and retrieval system for the Semantic Web -- retrieval system for the Semantic Web -- RDF RDF and OWLand OWL documents encoded in XML or N3 documents encoded in XML or N3
Swoogle extracts metadata for each discovered Swoogle extracts metadata for each discovered document, and computes relations among themdocument, and computes relations among them
Swoogle is intended as a resource to support Swoogle is intended as a resource to support services needed by services needed by software agents and software agents and programsprograms via web service interfaces and also via web service interfaces and also for semantic web researchers to use directly for semantic web researchers to use directly via the web interface via the web interface
It is It is notnot designed to support casual users designed to support casual users seeking to answer queries on the web (e.g., seeking to answer queries on the web (e.g., "what is the population of the capital of "what is the population of the capital of India?") India?")
Tap (extracted from the Tap (extracted from the site)site)
The TAP KB is a shallow but broad knowledge The TAP KB is a shallow but broad knowledge base containing basic lexical and taxonomic base containing basic lexical and taxonomic information about a wide range of popular information about a wide range of popular objects objects
Our goal is to bootstrap the Semantic Web by Our goal is to bootstrap the Semantic Web by providing a comprehensive providing a comprehensive source of basic source of basic informationinformation about popular objects about popular objects
The KB currently includes knowledge about, The KB currently includes knowledge about, Music: Popular music, musicians & groups, Music: Popular music, musicians & groups,
instruments, styles, composers instruments, styles, composers Movies: Top Movies, actors, television shows Movies: Top Movies, actors, television shows Authors: Top book authors, classic books Authors: Top book authors, classic books Sports: Athletes, sports, sports teams, equipment Sports: Athletes, sports, sports teams, equipment ……..
The KBThe KB
<tap:UnitedStatesSenator rdf:ID="http://tap.stanford.edu/data/PoliticianDodd,_Christopher"> <rdfs:label xml:lang="en">Christopher Dodd</rdfs:label> <tap:representsPlace rdf:resource="http://tap.stanford.edu/data/ConnecticutState"/> <tap:memberOf rdf:resource="http://tap.stanford.edu/data/USDemocraticParty"/> </tap:UnitedStatesSenator>
</rdfs:Class> <rdfs:Class rdf:ID="http://tap.stanford.edu/data/UnitedStatesSenator"> <rdfs:label xml:lang="en">Sen.</rdfs:label> <rdfs:label xml:lang="en">Senator</rdfs:label> <rdfs:subClassOf rdf:resource="http://tap.stanford.edu/data/Politician"/> <tap:plural>senator</tap:plural> </rdfs:Class>
Semantic Web: Task FormationSemantic Web: Task Formation
Use ontologies to Use ontologies to deducededuce a a workflowworkflow for performing a taskfor performing a task Applicable to composing web servicesApplicable to composing web services The task itself may involve a number of The task itself may involve a number of
sitessites Parts may be executable:Parts may be executable:
on the webon the web via other meansvia other means via web services via web services
The output may be a complete or partial The output may be a complete or partial task fulfillmenttask fulfillment
Business Trip Planner Business Trip Planner Agent Example-1Agent Example-1
Present coherent information for trip Present coherent information for trip planningplanning Dates, constraints, preferences, Dates, constraints, preferences,
organizational policyorganizational policy Company resources and clients in the areaCompany resources and clients in the area
History of contacts, clients, deals, prospectsHistory of contacts, clients, deals, prospects Destination conditions based on historical Destination conditions based on historical
datadata weather, tourist information, official holidaysweather, tourist information, official holidays
Latest news at destination and vicinityLatest news at destination and vicinity commercial, political, religious, security, crime, medicalcommercial, political, religious, security, crime, medical
Business Trip Planner Business Trip Planner Agent Example-2Agent Example-2
Additional information for trip planningAdditional information for trip planning Airline, hotel, car rental dataAirline, hotel, car rental data Suggest itinerary based on constraintsSuggest itinerary based on constraints Prepare to make reservations on-line Prepare to make reservations on-line Personal friends, family in the areaPersonal friends, family in the area Must visit tourist attractionsMust visit tourist attractions
dates, rates, photos, video, historical background, linksdates, rates, photos, video, historical background, links Major seasonal attractionsMajor seasonal attractions
festivals, concerts, theatrefestivals, concerts, theatre Once information is machine “understandable” Once information is machine “understandable”
one should be able to construct a trip one should be able to construct a trip planner planner agentagent
Technologies for Technologies for Supporting Supporting
Comprehensive SearchComprehensive Search1.1. Querying Modes and ControlQuerying Modes and Control
The exact structure may not always be known The exact structure may not always be known and relationships need be specified in a and relationships need be specified in a flexibleflexible way; various semantics are possible way; various semantics are possible
Declaratively stating Declaratively stating prioritiespriorities2.2. RankingRanking
Ranking is a critical component, both in Ranking is a critical component, both in weighting different scores as well as weighting different scores as well as controlling the ordering of result presentationcontrolling the ordering of result presentation
3.3. Neighborhood QueryingNeighborhood Querying Imprecise querying mode in which similar or Imprecise querying mode in which similar or
near entities/objects are retrievednear entities/objects are retrieved
1. Querying Modes and 1. Querying Modes and ControlControl NL understandingNL understanding
Web pages contain Web pages contain phrasesphrases whose similarity is not whose similarity is not just based on syntactical matching; the meaning just based on syntactical matching; the meaning may depend on may depend on contextcontext, language usage and more, language usage and more
Flexible QueryingFlexible Querying The exact structure may not always be known and The exact structure may not always be known and
relationships need be specified in a flexible way; relationships need be specified in a flexible way; various semantics are possiblevarious semantics are possible
Query control: PreferencesQuery control: Preferences A search may involve resources and tradeoffs may A search may involve resources and tradeoffs may
need to be specified; preferences may also address need to be specified; preferences may also address quality, recency, amount, language and other quality, recency, amount, language and other factorsfactors
Querying Modes and Querying Modes and Controls ExampleControls Example
Trying to locate information about a movie Trying to locate information about a movie based on fairly vague recollectionsbased on fairly vague recollections
It is based on a bookIt is based on a book It deals with military political issues, maybe It deals with military political issues, maybe
a coup or a coup attempt, or a kidnappinga coup or a coup attempt, or a kidnapping From the fifties or sixtiesFrom the fifties or sixties The lead role is a famous movie star of that The lead role is a famous movie star of that
timetime It’s not the one with Peter Sellers and it’s It’s not the one with Peter Sellers and it’s
not Failsafe and not the one with submarinesnot Failsafe and not the one with submarines The plot involves Generals, Colonels and the The plot involves Generals, Colonels and the
President, maybe not all of them and there President, maybe not all of them and there might also be a Senator or twomight also be a Senator or two
Querying Modes and Querying Modes and Controls ExampleControls Example
Solving the above may utilizeSolving the above may utilize a movie database with an associated a movie database with an associated
ontologyontology a a flexible querying languageflexible querying language that that
attempts at attempts at maximal subset satisfactionmaximal subset satisfaction a web search engine with some a web search engine with some NLNL
understanding (of the plot)understanding (of the plot)
Querying Modes and Querying Modes and Controls Example Con’t.Controls Example Con’t.
While I’m really interested, pleaseWhile I’m really interested, please Work on it for Work on it for no more than an hourno more than an hour Don’t spend more that Don’t spend more that a dollara dollar finding the finding the
answeranswer Use only highly Use only highly trustedtrusted sources sources Obtain Obtain photosphotos and video clips if possible, and video clips if possible,
especially those involving the lead star, especially those involving the lead star, Washington sites, trucks and airplanesWashington sites, trucks and airplanes
The The most importantmost important items are how much the items are how much the movie grossed and whether the lead star was movie grossed and whether the lead star was nominated for an Oscar for this movienominated for an Oscar for this movie
CompositionComposition Various “judges” may score differently; allow Various “judges” may score differently; allow
scoring of search terms, services, relevancyscoring of search terms, services, relevancy Top-k QueriesTop-k Queries
Multidimensional objects; monotone aggregation Multidimensional objects; monotone aggregation function on attributes; on each attribute, a list in function on attributes; on each attribute, a list in rank order; find k top ranked objects rank order; find k top ranked objects
Many variations; e.g., applications for finding Many variations; e.g., applications for finding “best” pages based on ranking by various services“best” pages based on ranking by various services
Ranked Query ResultsRanked Query Results Ranking query results in desired order also Ranking query results in desired order also
applies to the semantic web, important for applies to the semantic web, important for retaining user attention as well as in specifying retaining user attention as well as in specifying sub queries during compilation/executionsub queries during compilation/execution
2. Ranking2. Ranking
Ranking ExampleRanking Example
Continuing the previous example, textual Continuing the previous example, textual information may be provided by various information may be provided by various search engines – search engines – rankrank the information based the information based on the weights awarded to these engineson the weights awarded to these engines
Various photos may Various photos may scorescore differently on the differently on the star, Washington sites, airplanes and trucks, star, Washington sites, airplanes and trucks, find bestfind best
RankRank results, for example those that answer results, for example those that answer the the most conditionsmost conditions that are judged to be that are judged to be the the most importantmost important
k Nearest Neighborsk Nearest Neighbors Locate Locate near-by objectsnear-by objects in a in a
multidimensional space, objects may be multidimensional space, objects may be pages, or traditional objects, where each pages, or traditional objects, where each dimension corresponds to a property dimension corresponds to a property (attribute) (attribute)
Complex Similarity QueriesComplex Similarity Queries Identify Identify similar objectssimilar objects, to a given object , to a given object
setset Detecting “identical objects”Detecting “identical objects”
3. Neighborhood 3. Neighborhood Querying - flexibilityQuerying - flexibility
Neighborhood Querying - Neighborhood Querying - ExampleExample
Continuing the example, if a Continuing the example, if a coupcoup or or kidnappingkidnapping plot is not found, a close one plot is not found, a close one may be a may be a plotplot of some other type, for of some other type, for example an example an overthrowoverthrow, and instead of the , and instead of the military it may involve the secret servicemilitary it may involve the secret service
Maybe it was some otherMaybe it was some other vehicle vehicle rather rather than than truckstrucks or or planesplanes
Perhaps the movie was an Oscar candidate Perhaps the movie was an Oscar candidate in some other category or its director/star in some other category or its director/star were Oscar winners for other movieswere Oscar winners for other movies
Moving on…Moving on… The landscape is complexThe landscape is complex
Sophisticated tagging and information Sophisticated tagging and information aggregationaggregation
Merging object and document retrievalMerging object and document retrieval Focused searchFocused search New “sources” including RSS, Blogs, Wikis …New “sources” including RSS, Blogs, Wikis … Useful result presentationUseful result presentation Cooperative bookmarks management Cooperative bookmarks management
We explored some ways to take advantage We explored some ways to take advantage of this emerging landscape for of this emerging landscape for sophisticated search and integration taskssophisticated search and integration tasks