Contextual Insight in SearchEnabling Technologies and ApplicationsAleksander Øhrn, PhDAugust 31, 2005
Outline
• Background and scope
• Enabling technologies– Scalable search and data aggregation– XML search– Information extraction
• Example applications
Background
• Fast Search & Transfer ASA– Founded 1997, grew out of research at NTNU– Enterprise search market leader– Previously a major web search player with alltheweb.com
• Me– PhD in computer science from NTNU (2000)– Chief scientist at Fast Search & Transfer ASA– Main focus on the “intelligence layer” of search
Scope
Scalable search
Information extractionXML technologies
“Sentences where someone says something positive about Adidas.”
“Sentences where the acronym ‘MIT’ is defined.”
“Dates and locations related to the query ‘d-day’.”
“Paragraphs that discuss a company merger or acquisition.”
“Paragraphs that contain quotations by George W. Bush, where he mentions a monetary amount.”
“Quotations where somebody says something about the Gaza Strip.”
Search engines and databases
• Serve similar purposes– Organize your data so that it can be queried
• Hybrid deployments– Database offloading
Search engines Relational databases
Data types Originally developed with unstructured textual data in mind, are becoming increasingly good also at dealing with structured data
Originally developed with structured data in mind, are becoming increasingly good also at dealing with unstructured textual data
Loss tolerance Typically optimized for finding and returning only “enough”, “good” results
Typically optimized for finding all results
Performance focus Getting results back fast Depends
Ranking Natural Sorting by some field
Advanced linguistics Lots Little
Transactions Not really Yes
Joins Not really Yes
Anatomy of a search platform
CO
NT
EN
T A
PI
QU
ER
Y A
PI
MANAGEMENT & APPLICATION SERVICES
SECURITY ACCESSACL Monitor User Monitor
>
Portlets
Applications
SFE
Custom
Web
File
QUERY CONNECTORS
CONTENT CONNECTORS
Database
Application
Custom
Deployment Business Application Administration
TOOLS & TOOL BUILDING FRAMEWORK
Custom
• Processes the content before it gets indexed– Documents flow through a pipeline of processing stages– Highly customizable
• Example processing stages– Format, language and encoding detection– Format and encoding normalization– HTML parsing– Entity extraction– Vectorization– Categorization– Lemmatization and synonym handling– Anchor text harvesting– ...
Document processing
venus williams; arthur ashe; ...
{(venus williams, 1.0), (wimbledon, 0.81), (center court, 0.65), ...}
sports/tennis
mouse ~ mice; car ~ automobile
pdf → html; iso-8859-1 → utf-8
pdf; english; iso-8859-1
Document Processing
“UIMA stands for Unstructured Information Management Architecture. It is a component software architecture for the development, discovery, composition, and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies.”
Query processing
• Processes the query before it gets sent to the search engine– Queries flow through a pipeline of processing stages– Highly customizable
• Example processing stages– Stopword handling– Phrasing and antiphrasing– Spellchecking– Natural language handling– Lemmatization and synonym handling– Ontologies– ...
mouse ~ mice; car ~ automobile
xsara < citröen < car < vehicle
brittany speers → britney spears
television under 200 dollars → television AND price:<200
red cross → “red cross”; where can i find information about cars → cars
Engine architecture
• Scaling in data volume– Add columns– Each column holds a
partition of the data– Query the partitions in
parallel
• Scaling in query traffic– Add rows– Replicate the partitions– Distribute the queries
• Scaling in other dimensions– Query complexity– Fault tolerance
Engine architecture
qrserver nodes
dispatch nodes
search nodes(single tier)
indexing nodes
• Query and result processing– Turn bad queries into good ones– Help navigate among matching
documents
• Divide and conquer– Parallelize the searching over disjoint
sets of documents– Merge the results
Engine architecture
• For very big deployment scenarios
– Web scale, i.e., billions of documents
• All search nodes are equal, but some are more equal than others
– Organize the search nodes into multiple tiers– Top tier nodes may have fewer documents and run on better hardware– Keep the good stuff in the top tiers– Only fall through to the lower tiers if not enough good matches are not found in the top tiers– Use query logs to decide which documents that belong in which tiers
Tier 1
Tier 2
Tier 3
Fallthrough?
Fallthrough?
Data aggregation
• Computes histograms and scalar statistics across the result set– Counting is done distributed across the search nodes– For numerical or temporal fields, this may involve applying a real-time
discretization algorithm– The histograms can be scored according to their ”interestingness”
• Histograms may serve as query refinement sources– The counts preview how many results to expect– Adds a discovery aspect
Data aggregation
• Trends and correlations– Query specific
• Ad hoc reports and OLAP– Analytical applications
powered by search
Research Topics
stress echocardiographyimage orientationregurgitant orificeabnormal relaxation
two-dimensional echocardiographyventricular response in patients
initial repairmitral lesions
echocardiographic contrastmyocardial infarction
Co-Authors
A Tajik 56J Oh 25P Pellikka 16B Khanderia 1
6D Hagler 13V Roger 13K Bailey 13F Miller 11
Mayo
Cli
nic
Pro
c
J A
m S
oc E
ch
ocard
iog
rap
hy
J A
m C
oll
Card
iolo
gy
Cir
cu
lati
on
Am
J C
ard
iolo
gy
Ech
ocard
iog
rap
hy
Am
Heart
J
Dobutamine 1 2 5 4 1Cardiotonic agents 1 3 2 1Contrast media 1 1Atropine 1Imagent US 1
Chemical substances
Jou
rnals
Documents authoredby J Seward
21
Result processing
• Unsupervised clustering– Analyzes on-the-fly similarities across matching documents– Group together documents having similar content– Provides a bird’s-eye view of topical spread
• Query refinement suggestions– Examines the distribution of meta data– Builds a histogram of values– Provides a means for slicing and dicing the result set
• Filtering– E.g., dynamic duplicate detection
Relevancy
RelevancyprimitivesRelevancyprimitives
CompletenessAuthorityStatisticsQualityFreshness
CASQF
C SA FQ
Rankscore
Completeness
How well does the query match superior contexts?
Authority Is the document considered an authority for this query?
Statistics How well does the content of this document match the query?
Quality Is this a document of ”high quality”?
Freshness How old is this document, when was it last updated?
Geography Where are you querying from?
XML search
• Moving beyond a flat document model– From a simple (field, value) layout to complex,
nested structure having scopes/tags– From a predefined index layout to schema
flexibility
• Some queries cannot be adequately handledwithout structure– Flattening out the content won’t quite work– False positives slip through
<authors> <author>John P. Brown</author> <author>George Smith</author></authors>“Show me documents authored by John Smith”
XML search
• Queries that combine structure and content– Impose contextual constraints on the content
• FAST Query Language (FQL)– Partial overlap with XQuery– Linguistic extensions
• Return matching scopes– See the matching document
fragments, including markupand annotations
<matches> <match> <sentence>The publicist of <person>Robert De Niro</person> announces that the actor has prostate cancer.</sentence> </match> <match> <sentence><person>De Niro</person> was diagnosed with cancer last week.</sentence> </match> <match> <sentence>”He’ll fight the cancer,” says <person>John Barnes</person>, founder of his Welsh fanclub. </match></matches>
xml:sentence:(“cancer” and scope(person))
//sentence[fast:matches(., “cancer”) and .//person]
Data aggregation and XML
• Constrain the aggregation to the level of matching scopes– Document-level aggregation may yield too imprecise results– Provides a more factual and relevant relation between the query and
the histogram entries– Example from Wikipedia
• Or to some other constrained scope– E.g., match documents that within the same sentence contains a
company the word ’scandal’, but aggregate over all person names that occur in sentences that contain the job title ’cfo’.
Persons that appear in documents that contain the word ‘soccer’
Persons that appear in paragraphs that contain the word ‘soccer’
Data aggregation and XML
• Aggregate not only over content, but also over structure!
• Discovery, schema exploration and disambiguation
xml:sentence:(“clinton” and scope(date))
<sentence>So, whether you are visiting for the <date base=“XXXX-07-04”>4th of July</date> weekend or an extended stay, we are confident that you will enjoy your time in <location type=“city”>Clinton</location>!</sentence>
<sentence><person>Hillary Rodham Clinton</person> was elected <location type=“country”>United States</location> Senator from <location type=“city”>New York</location> on <date base=“2000-11-07”>November 7, 2000</date>.</sentence><sentence><person>Chelsea Clinton</person> has been dating <person>Jeremy Kane</person>, a <university>Stanford University</university> classmate since <date base=“2000-09-08”>Sept. 8, 2000</date>.</sentence>
person (3)date (3)location (2)university (1)
person (2)location (1)
“Which scopes exist in my matching document fragments? “
“Inside which scopes are my query terms found?”
Information extraction
• Apply text mining techniques to identify entities of interest– Structural and semantic regions– Makes unstructured data more structured
• Mark them up in context– Grammars as scope producers– Scopes can be annotated with meta data
• Make it possible to act on the information!– E.g., make it searchable in a way that preserves context
...in fair and free elections.</sentence> <sentence><person base=“Leonid Kuchma” title=“president” >Kuchma</person> was reelected in <date base=“1999-11-XX”>November 1999</date> to another five-year term, with 56% of the vote.</sentence> <sentence>International observers criticized...
...in fair and free elections. Kuchma was reelected in November 1999 to another five-year term, with 56% of the vote. International observers criticized...
Searching with context
• High search precision and expressiveness– Semantically logical boundaries– Disambiguation– Wildcards at the semantic level– Work with normalized forms– Where there’s smoke, there’s fire– ...
• Navigation– Suggestions to broaden/refine – Discover, summarize and disambiguate
E.g., ‘bush’ as in ‘George Bush’ and not ‘burning bush’
Predicates that test for scope existence
E.g., use sentence or paragraph as the main enclosing scope
2001-09-11, Sept. 11, 11th of September, ...
E.g., use existence of a zip code to find an address
sentence → paragraph; paragraph → sentence
’clinton’ as in ’person’ or in ’location’?
Complex relations
• Relations are often of interest– E.g., (gene x disease) or (person x company x jobtitle) or (company x product) – Possible to model explicitly, if we are willing to invest the effort
• Can we avoid upfront modelling?– We want to enable ad hoc fact finding and relations– We cannot foresee everything– Even if we did, we may not want to spend the time modelling all the targeted combinations– We want to maximize document processing throughput
• Use a scoped search to relate entities– Require the entity types involved to co-occur within the same semantically meaningful scope– Filter and count over the matching scopes
• Statistics to the rescue?– Co-occurrence is no guarantee for relatedness– Implies a certain amount of redundancy to work well
<works-for person=“John Smith” company=“Microsoft” title=“Vice President”> <sentence><person>Johnny</person> works at <company>Microsoft</company>.</sentence> <sentence>He is a <jobtitle>VP</jobtitle> there.</sentence></works-for>
POS tagging
• Annotate each token with part-of-speech data– Noun, verb, adjective, ...
• Markov based taggers– Treat sequences of tags as a Markov model– Use the Viterbi algorithm to find the most probable path– Extensions to standard Markov models, e.g., using trigrams
• Transformation based taggers– Assign the most probable tags as a starting point– Use rewrite rules to modify the tags– Learning algorithm to find the best rules and their application order
POS tagging
• Annotate the text with zero or more layers of meta data– The original surface form of the text can be viewed as trivial meta data
• Apply a pattern matcher over selected layers– Handcrafted rules or trained models– Extract the surface forms that correspond to the matching patterns
• Regular expressions over the surface forms go a long way– Modern regular expression engines go beyond regular languages– Callouts are our friends! Use them as “oracles”
Layered pattern matching
... ... ... ... ... ...
Layer 2 Man Food
Layer 1 N/proper V/past/eat
DET ADJ N/singular
Layer 0 Richard ate some bad curry
Sentiment analysis
“What is the tone of this document?”
“Is there a positive spin on current news related to my brand?”
“Thumbs up or thumbs down?”
Sentiment analysis
• Based on work by Turney and Littman– Depends on initial word seeds defined as “polar opposites”– Search for interesting words and phrases in a training corpus that
occur together with the seeds, and score them– Use the scored words and phrases as the classification vocabulary
• Interesting generalizations possible– Multiple dimensions– Dimensions other than thumbs up/down
Threat Assessment
Bin Laden Message
al-Khobar
Communique #1
Hamas Decree
Communique #2
Iran Statement
Nuclear 9/11 Assessment
Port Security
Securing Homeland Plan
Terrorist Nuclear material
Terrorist Obj's
Haifa Street
asteroid
evolution
gun
Terrorist Cells
-200
-100
0
100
200
300
400
500
-350 -250 -150 -50 50 150 250 350
Timing
Sc
ale
Lo
cal
Lo
cal
Glo
bal
Glo
bal
DistantDistant ImminentImminent
Sentiment analysis and XML
• Bring the analysis down to a subdocument level– Provides a more accurate sentiment assessment– Enables precise searching– Enables retrieval of the “evidence”
• Scalar statistics aggregated over matching scopes– E.g, average sentiment degree
<sentence><company>Adidas</company> <sentiment degree="-0.6">failed to deliver</sentiment> a profit in Q1.</sentence>
<sentence>The new shoe from <company>Adidas</company> is <sentiment degree="0.9">really awesome</sentiment>!</sentence>
Task-oriented interfaces
q, e → xml:sentence:(q and scope(e))
Question answering
• Some classes of questions transform naturally to scope queries
• Statistics to the rescue!
when (is|was|did|...) X? → xml:sentence:(X’ and (scope(date) or scope(time)))
what does X stand for? →xml:acronym:(@base:X and scope(@definition))
who works (at|for) X? →xml:sentence:(company:X and scope(person) and
scope(jobtitle))
why (is|are|do|does|have|has|...) X? →xml:sentence:(X’ and (“because” or “due to” or ...))
when did the berlin wall fall? when is christmas? when was d-day?
Audio and video
• Speech-to-text software as an XML producer– Speaker identification– Scene changes– Timing information
• Contextual access to multimedia content
Contextual navigation Contextual
access
Scope
Scalable search
Information extractionXML technologies
“Sentences where someone says something positive about Adidas.”
“Sentences where the acronym ‘MIT’ is defined.”
“Dates and locations related to the query ‘d-day’.”
xml:quotation:(“gaza strip”)
xml:sentence:(“adidas” and sentiment:@degree:>0)
“Paragraphs that discuss a company merger or acquisition.”
“Paragraphs that contain quotations by George W. Bush, where he mentions a monetary amount.”
“Quotations where somebody says something about the Gaza Strip.”
xml:sentence:acronym:(@base:“mit” and scope(@definition))
xml:sentence:(“d-day” and (scope(date) or scope(location))
xml:paragraph:(string(“merger”, linguistics=“on”)
and scope(company) and scope(price))
xml:paragraph:quotation:(@speaker:”bush” and scope(price)))
Concluding thoughts
• Opportunities for semantically motivated relevancy models– Like business rules, almost
• What kind of applications would emerge...– ...if the entire web was enriched in this way?