35
Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Embed Size (px)

Citation preview

Page 1: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Contextual Insight in SearchEnabling Technologies and ApplicationsAleksander Øhrn, PhDAugust 31, 2005

Page 2: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Outline

• Background and scope

• Enabling technologies– Scalable search and data aggregation– XML search– Information extraction

• Example applications

Page 3: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Background

• Fast Search & Transfer ASA– Founded 1997, grew out of research at NTNU– Enterprise search market leader– Previously a major web search player with alltheweb.com

• Me– PhD in computer science from NTNU (2000)– Chief scientist at Fast Search & Transfer ASA– Main focus on the “intelligence layer” of search

Page 4: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Scope

Scalable search

Information extractionXML technologies

“Sentences where someone says something positive about Adidas.”

“Sentences where the acronym ‘MIT’ is defined.”

“Dates and locations related to the query ‘d-day’.”

“Paragraphs that discuss a company merger or acquisition.”

“Paragraphs that contain quotations by George W. Bush, where he mentions a monetary amount.”

“Quotations where somebody says something about the Gaza Strip.”

Page 5: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Search engines and databases

• Serve similar purposes– Organize your data so that it can be queried

• Hybrid deployments– Database offloading

Search engines Relational databases

Data types Originally developed with unstructured textual data in mind, are becoming increasingly good also at dealing with structured data

Originally developed with structured data in mind, are becoming increasingly good also at dealing with unstructured textual data

Loss tolerance Typically optimized for finding and returning only “enough”, “good” results

Typically optimized for finding all results

Performance focus Getting results back fast Depends

Ranking Natural Sorting by some field

Advanced linguistics Lots Little

Transactions Not really Yes

Joins Not really Yes

Page 6: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Anatomy of a search platform

CO

NT

EN

T A

PI

QU

ER

Y A

PI

MANAGEMENT & APPLICATION SERVICES

SECURITY ACCESSACL Monitor User Monitor

>

Portlets

Applications

SFE

Custom

Web

File

QUERY CONNECTORS

CONTENT CONNECTORS

Database

Application

Custom

Deployment Business Application Administration

TOOLS & TOOL BUILDING FRAMEWORK

Custom

Page 7: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

• Processes the content before it gets indexed– Documents flow through a pipeline of processing stages– Highly customizable

• Example processing stages– Format, language and encoding detection– Format and encoding normalization– HTML parsing– Entity extraction– Vectorization– Categorization– Lemmatization and synonym handling– Anchor text harvesting– ...

Document processing

venus williams; arthur ashe; ...

{(venus williams, 1.0), (wimbledon, 0.81), (center court, 0.65), ...}

sports/tennis

mouse ~ mice; car ~ automobile

pdf → html; iso-8859-1 → utf-8

pdf; english; iso-8859-1

Page 8: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Document Processing

“UIMA stands for Unstructured Information Management Architecture. It is a component software architecture for the development, discovery, composition, and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies.”

Page 9: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Query processing

• Processes the query before it gets sent to the search engine– Queries flow through a pipeline of processing stages– Highly customizable

• Example processing stages– Stopword handling– Phrasing and antiphrasing– Spellchecking– Natural language handling– Lemmatization and synonym handling– Ontologies– ...

mouse ~ mice; car ~ automobile

xsara < citröen < car < vehicle

brittany speers → britney spears

television under 200 dollars → television AND price:<200

red cross → “red cross”; where can i find information about cars → cars

Page 10: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Engine architecture

• Scaling in data volume– Add columns– Each column holds a

partition of the data– Query the partitions in

parallel

• Scaling in query traffic– Add rows– Replicate the partitions– Distribute the queries

• Scaling in other dimensions– Query complexity– Fault tolerance

Page 11: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Engine architecture

qrserver nodes

dispatch nodes

search nodes(single tier)

indexing nodes

• Query and result processing– Turn bad queries into good ones– Help navigate among matching

documents

• Divide and conquer– Parallelize the searching over disjoint

sets of documents– Merge the results

Page 12: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Engine architecture

• For very big deployment scenarios

– Web scale, i.e., billions of documents

• All search nodes are equal, but some are more equal than others

– Organize the search nodes into multiple tiers– Top tier nodes may have fewer documents and run on better hardware– Keep the good stuff in the top tiers– Only fall through to the lower tiers if not enough good matches are not found in the top tiers– Use query logs to decide which documents that belong in which tiers

Tier 1

Tier 2

Tier 3

Fallthrough?

Fallthrough?

Page 13: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Data aggregation

• Computes histograms and scalar statistics across the result set– Counting is done distributed across the search nodes– For numerical or temporal fields, this may involve applying a real-time

discretization algorithm– The histograms can be scored according to their ”interestingness”

• Histograms may serve as query refinement sources– The counts preview how many results to expect– Adds a discovery aspect

Page 14: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Data aggregation

• Trends and correlations– Query specific

• Ad hoc reports and OLAP– Analytical applications

powered by search

Research Topics

stress echocardiographyimage orientationregurgitant orificeabnormal relaxation

two-dimensional echocardiographyventricular response in patients

initial repairmitral lesions

echocardiographic contrastmyocardial infarction

Co-Authors

A Tajik 56J Oh 25P Pellikka 16B Khanderia 1

6D Hagler 13V Roger 13K Bailey 13F Miller 11

Mayo

Cli

nic

Pro

c

J A

m S

oc E

ch

ocard

iog

rap

hy

J A

m C

oll

Card

iolo

gy

Cir

cu

lati

on

Am

J C

ard

iolo

gy

Ech

ocard

iog

rap

hy

Am

Heart

J

Dobutamine 1 2 5 4 1Cardiotonic agents 1 3 2 1Contrast media 1 1Atropine 1Imagent US 1

Chemical substances

Jou

rnals

Documents authoredby J Seward

21

Page 15: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Result processing

• Unsupervised clustering– Analyzes on-the-fly similarities across matching documents– Group together documents having similar content– Provides a bird’s-eye view of topical spread

• Query refinement suggestions– Examines the distribution of meta data– Builds a histogram of values– Provides a means for slicing and dicing the result set

• Filtering– E.g., dynamic duplicate detection

Page 16: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Relevancy

RelevancyprimitivesRelevancyprimitives

CompletenessAuthorityStatisticsQualityFreshness

CASQF

C SA FQ

Rankscore

Completeness

How well does the query match superior contexts?

Authority Is the document considered an authority for this query?

Statistics How well does the content of this document match the query?

Quality Is this a document of ”high quality”?

Freshness How old is this document, when was it last updated?

Geography Where are you querying from?

Page 17: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

XML search

• Moving beyond a flat document model– From a simple (field, value) layout to complex,

nested structure having scopes/tags– From a predefined index layout to schema

flexibility

• Some queries cannot be adequately handledwithout structure– Flattening out the content won’t quite work– False positives slip through

<authors> <author>John P. Brown</author> <author>George Smith</author></authors>“Show me documents authored by John Smith”

Page 18: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

XML search

• Queries that combine structure and content– Impose contextual constraints on the content

• FAST Query Language (FQL)– Partial overlap with XQuery– Linguistic extensions

• Return matching scopes– See the matching document

fragments, including markupand annotations

<matches> <match> <sentence>The publicist of <person>Robert De Niro</person> announces that the actor has prostate cancer.</sentence> </match> <match> <sentence><person>De Niro</person> was diagnosed with cancer last week.</sentence> </match> <match> <sentence>”He’ll fight the cancer,” says <person>John Barnes</person>, founder of his Welsh fanclub. </match></matches>

xml:sentence:(“cancer” and scope(person))

//sentence[fast:matches(., “cancer”) and .//person]

Page 19: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Data aggregation and XML

• Constrain the aggregation to the level of matching scopes– Document-level aggregation may yield too imprecise results– Provides a more factual and relevant relation between the query and

the histogram entries– Example from Wikipedia

• Or to some other constrained scope– E.g., match documents that within the same sentence contains a

company the word ’scandal’, but aggregate over all person names that occur in sentences that contain the job title ’cfo’.

Persons that appear in documents that contain the word ‘soccer’

Persons that appear in paragraphs that contain the word ‘soccer’

Page 20: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Data aggregation and XML

• Aggregate not only over content, but also over structure!

• Discovery, schema exploration and disambiguation

xml:sentence:(“clinton” and scope(date))

<sentence>So, whether you are visiting for the <date base=“XXXX-07-04”>4th of July</date> weekend or an extended stay, we are confident that you will enjoy your time in <location type=“city”>Clinton</location>!</sentence>

<sentence><person>Hillary Rodham Clinton</person> was elected <location type=“country”>United States</location> Senator from <location type=“city”>New York</location> on <date base=“2000-11-07”>November 7, 2000</date>.</sentence><sentence><person>Chelsea Clinton</person> has been dating <person>Jeremy Kane</person>, a <university>Stanford University</university> classmate since <date base=“2000-09-08”>Sept. 8, 2000</date>.</sentence>

person (3)date (3)location (2)university (1)

person (2)location (1)

“Which scopes exist in my matching document fragments? “

“Inside which scopes are my query terms found?”

Page 21: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Information extraction

• Apply text mining techniques to identify entities of interest– Structural and semantic regions– Makes unstructured data more structured

• Mark them up in context– Grammars as scope producers– Scopes can be annotated with meta data

• Make it possible to act on the information!– E.g., make it searchable in a way that preserves context

...in fair and free elections.</sentence> <sentence><person base=“Leonid Kuchma” title=“president” >Kuchma</person> was reelected in <date base=“1999-11-XX”>November 1999</date> to another five-year term, with 56% of the vote.</sentence> <sentence>International observers criticized...

...in fair and free elections. Kuchma was reelected in November 1999 to another five-year term, with 56% of the vote. International observers criticized...

Page 22: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Searching with context

• High search precision and expressiveness– Semantically logical boundaries– Disambiguation– Wildcards at the semantic level– Work with normalized forms– Where there’s smoke, there’s fire– ...

• Navigation– Suggestions to broaden/refine – Discover, summarize and disambiguate

E.g., ‘bush’ as in ‘George Bush’ and not ‘burning bush’

Predicates that test for scope existence

E.g., use sentence or paragraph as the main enclosing scope

2001-09-11, Sept. 11, 11th of September, ...

E.g., use existence of a zip code to find an address

sentence → paragraph; paragraph → sentence

’clinton’ as in ’person’ or in ’location’?

Page 23: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Complex relations

• Relations are often of interest– E.g., (gene x disease) or (person x company x jobtitle) or (company x product) – Possible to model explicitly, if we are willing to invest the effort

• Can we avoid upfront modelling?– We want to enable ad hoc fact finding and relations– We cannot foresee everything– Even if we did, we may not want to spend the time modelling all the targeted combinations– We want to maximize document processing throughput

• Use a scoped search to relate entities– Require the entity types involved to co-occur within the same semantically meaningful scope– Filter and count over the matching scopes

• Statistics to the rescue?– Co-occurrence is no guarantee for relatedness– Implies a certain amount of redundancy to work well

<works-for person=“John Smith” company=“Microsoft” title=“Vice President”> <sentence><person>Johnny</person> works at <company>Microsoft</company>.</sentence> <sentence>He is a <jobtitle>VP</jobtitle> there.</sentence></works-for>

Page 24: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

POS tagging

• Annotate each token with part-of-speech data– Noun, verb, adjective, ...

• Markov based taggers– Treat sequences of tags as a Markov model– Use the Viterbi algorithm to find the most probable path– Extensions to standard Markov models, e.g., using trigrams

• Transformation based taggers– Assign the most probable tags as a starting point– Use rewrite rules to modify the tags– Learning algorithm to find the best rules and their application order

Page 25: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

POS tagging

Page 26: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

• Annotate the text with zero or more layers of meta data– The original surface form of the text can be viewed as trivial meta data

• Apply a pattern matcher over selected layers– Handcrafted rules or trained models– Extract the surface forms that correspond to the matching patterns

• Regular expressions over the surface forms go a long way– Modern regular expression engines go beyond regular languages– Callouts are our friends! Use them as “oracles”

Layered pattern matching

... ... ... ... ... ...

Layer 2 Man Food

Layer 1 N/proper V/past/eat

DET ADJ N/singular

Layer 0 Richard ate some bad curry

Page 27: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Sentiment analysis

“What is the tone of this document?”

“Is there a positive spin on current news related to my brand?”

“Thumbs up or thumbs down?”

Page 28: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Sentiment analysis

• Based on work by Turney and Littman– Depends on initial word seeds defined as “polar opposites”– Search for interesting words and phrases in a training corpus that

occur together with the seeds, and score them– Use the scored words and phrases as the classification vocabulary

• Interesting generalizations possible– Multiple dimensions– Dimensions other than thumbs up/down

Threat Assessment

Bin Laden Message

al-Khobar

Communique #1

Hamas Decree

Communique #2

Iran Statement

Nuclear 9/11 Assessment

Port Security

Securing Homeland Plan

Terrorist Nuclear material

Terrorist Obj's

Haifa Street

asteroid

evolution

gun

Terrorist Cells

-200

-100

0

100

200

300

400

500

-350 -250 -150 -50 50 150 250 350

Timing

Sc

ale

Lo

cal

Lo

cal

Glo

bal

Glo

bal

DistantDistant ImminentImminent

Page 29: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Sentiment analysis and XML

• Bring the analysis down to a subdocument level– Provides a more accurate sentiment assessment– Enables precise searching– Enables retrieval of the “evidence”

• Scalar statistics aggregated over matching scopes– E.g, average sentiment degree

<sentence><company>Adidas</company> <sentiment degree="-0.6">failed to deliver</sentiment> a profit in Q1.</sentence>

<sentence>The new shoe from <company>Adidas</company> is <sentiment degree="0.9">really awesome</sentiment>!</sentence>

Page 30: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Task-oriented interfaces

q, e → xml:sentence:(q and scope(e))

Page 31: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Question answering

• Some classes of questions transform naturally to scope queries

• Statistics to the rescue!

when (is|was|did|...) X? → xml:sentence:(X’ and (scope(date) or scope(time)))

what does X stand for? →xml:acronym:(@base:X and scope(@definition))

who works (at|for) X? →xml:sentence:(company:X and scope(person) and

scope(jobtitle))

why (is|are|do|does|have|has|...) X? →xml:sentence:(X’ and (“because” or “due to” or ...))

when did the berlin wall fall? when is christmas? when was d-day?

Page 32: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Audio and video

• Speech-to-text software as an XML producer– Speaker identification– Scene changes– Timing information

• Contextual access to multimedia content

Contextual navigation Contextual

access

Page 33: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Scope

Scalable search

Information extractionXML technologies

“Sentences where someone says something positive about Adidas.”

“Sentences where the acronym ‘MIT’ is defined.”

“Dates and locations related to the query ‘d-day’.”

xml:quotation:(“gaza strip”)

xml:sentence:(“adidas” and sentiment:@degree:>0)

“Paragraphs that discuss a company merger or acquisition.”

“Paragraphs that contain quotations by George W. Bush, where he mentions a monetary amount.”

“Quotations where somebody says something about the Gaza Strip.”

xml:sentence:acronym:(@base:“mit” and scope(@definition))

xml:sentence:(“d-day” and (scope(date) or scope(location))

xml:paragraph:(string(“merger”, linguistics=“on”)

and scope(company) and scope(price))

xml:paragraph:quotation:(@speaker:”bush” and scope(price)))

Page 34: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005

Concluding thoughts

• Opportunities for semantically motivated relevancy models– Like business rules, almost

• What kind of applications would emerge...– ...if the entire web was enriched in this way?

Page 35: Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31, 2005