Contextual Insight in Search Enabling Technologies and Applications Aleksander Øhrn, PhD August 31,...

Preview:

Citation preview

Contextual Insight in SearchEnabling Technologies and ApplicationsAleksander Øhrn, PhDAugust 31, 2005

Outline

• Background and scope

• Enabling technologies– Scalable search and data aggregation– XML search– Information extraction

• Example applications

Background

• Fast Search & Transfer ASA– Founded 1997, grew out of research at NTNU– Enterprise search market leader– Previously a major web search player with alltheweb.com

• Me– PhD in computer science from NTNU (2000)– Chief scientist at Fast Search & Transfer ASA– Main focus on the “intelligence layer” of search

Scope

Scalable search

Information extractionXML technologies

“Sentences where someone says something positive about Adidas.”

“Sentences where the acronym ‘MIT’ is defined.”

“Dates and locations related to the query ‘d-day’.”

“Paragraphs that discuss a company merger or acquisition.”

“Paragraphs that contain quotations by George W. Bush, where he mentions a monetary amount.”

“Quotations where somebody says something about the Gaza Strip.”

Search engines and databases

• Serve similar purposes– Organize your data so that it can be queried

• Hybrid deployments– Database offloading

Search engines Relational databases

Data types Originally developed with unstructured textual data in mind, are becoming increasingly good also at dealing with structured data

Originally developed with structured data in mind, are becoming increasingly good also at dealing with unstructured textual data

Loss tolerance Typically optimized for finding and returning only “enough”, “good” results

Typically optimized for finding all results

Performance focus Getting results back fast Depends

Ranking Natural Sorting by some field

Advanced linguistics Lots Little

Transactions Not really Yes

Joins Not really Yes

Anatomy of a search platform

CO

NT

EN

T A

PI

QU

ER

Y A

PI

MANAGEMENT & APPLICATION SERVICES

SECURITY ACCESSACL Monitor User Monitor

>

Portlets

Applications

SFE

Custom

Web

File

QUERY CONNECTORS

CONTENT CONNECTORS

Database

Application

Custom

Deployment Business Application Administration

TOOLS & TOOL BUILDING FRAMEWORK

Custom

• Processes the content before it gets indexed– Documents flow through a pipeline of processing stages– Highly customizable

• Example processing stages– Format, language and encoding detection– Format and encoding normalization– HTML parsing– Entity extraction– Vectorization– Categorization– Lemmatization and synonym handling– Anchor text harvesting– ...

Document processing

venus williams; arthur ashe; ...

{(venus williams, 1.0), (wimbledon, 0.81), (center court, 0.65), ...}

sports/tennis

mouse ~ mice; car ~ automobile

pdf → html; iso-8859-1 → utf-8

pdf; english; iso-8859-1

Document Processing

“UIMA stands for Unstructured Information Management Architecture. It is a component software architecture for the development, discovery, composition, and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies.”

Query processing

• Processes the query before it gets sent to the search engine– Queries flow through a pipeline of processing stages– Highly customizable

• Example processing stages– Stopword handling– Phrasing and antiphrasing– Spellchecking– Natural language handling– Lemmatization and synonym handling– Ontologies– ...

mouse ~ mice; car ~ automobile

xsara < citröen < car < vehicle

brittany speers → britney spears

television under 200 dollars → television AND price:<200

red cross → “red cross”; where can i find information about cars → cars

Engine architecture

• Scaling in data volume– Add columns– Each column holds a

partition of the data– Query the partitions in

parallel

• Scaling in query traffic– Add rows– Replicate the partitions– Distribute the queries

• Scaling in other dimensions– Query complexity– Fault tolerance

Engine architecture

qrserver nodes

dispatch nodes

search nodes(single tier)

indexing nodes

• Query and result processing– Turn bad queries into good ones– Help navigate among matching

documents

• Divide and conquer– Parallelize the searching over disjoint

sets of documents– Merge the results

Engine architecture

• For very big deployment scenarios

– Web scale, i.e., billions of documents

• All search nodes are equal, but some are more equal than others

– Organize the search nodes into multiple tiers– Top tier nodes may have fewer documents and run on better hardware– Keep the good stuff in the top tiers– Only fall through to the lower tiers if not enough good matches are not found in the top tiers– Use query logs to decide which documents that belong in which tiers

Tier 1

Tier 2

Tier 3

Fallthrough?

Fallthrough?

Data aggregation

• Computes histograms and scalar statistics across the result set– Counting is done distributed across the search nodes– For numerical or temporal fields, this may involve applying a real-time

discretization algorithm– The histograms can be scored according to their ”interestingness”

• Histograms may serve as query refinement sources– The counts preview how many results to expect– Adds a discovery aspect

Data aggregation

• Trends and correlations– Query specific

• Ad hoc reports and OLAP– Analytical applications

powered by search

Research Topics

stress echocardiographyimage orientationregurgitant orificeabnormal relaxation

two-dimensional echocardiographyventricular response in patients

initial repairmitral lesions

echocardiographic contrastmyocardial infarction

Co-Authors

A Tajik 56J Oh 25P Pellikka 16B Khanderia 1

6D Hagler 13V Roger 13K Bailey 13F Miller 11

Mayo

Cli

nic

Pro

c

J A

m S

oc E

ch

ocard

iog

rap

hy

J A

m C

oll

Card

iolo

gy

Cir

cu

lati

on

Am

J C

ard

iolo

gy

Ech

ocard

iog

rap

hy

Am

Heart

J

Dobutamine 1 2 5 4 1Cardiotonic agents 1 3 2 1Contrast media 1 1Atropine 1Imagent US 1

Chemical substances

Jou

rnals

Documents authoredby J Seward

21

Result processing

• Unsupervised clustering– Analyzes on-the-fly similarities across matching documents– Group together documents having similar content– Provides a bird’s-eye view of topical spread

• Query refinement suggestions– Examines the distribution of meta data– Builds a histogram of values– Provides a means for slicing and dicing the result set

• Filtering– E.g., dynamic duplicate detection

Relevancy

RelevancyprimitivesRelevancyprimitives

CompletenessAuthorityStatisticsQualityFreshness

CASQF

C SA FQ

Rankscore

Completeness

How well does the query match superior contexts?

Authority Is the document considered an authority for this query?

Statistics How well does the content of this document match the query?

Quality Is this a document of ”high quality”?

Freshness How old is this document, when was it last updated?

Geography Where are you querying from?

XML search

• Moving beyond a flat document model– From a simple (field, value) layout to complex,

nested structure having scopes/tags– From a predefined index layout to schema

flexibility

• Some queries cannot be adequately handledwithout structure– Flattening out the content won’t quite work– False positives slip through

<authors> <author>John P. Brown</author> <author>George Smith</author></authors>“Show me documents authored by John Smith”

XML search

• Queries that combine structure and content– Impose contextual constraints on the content

• FAST Query Language (FQL)– Partial overlap with XQuery– Linguistic extensions

• Return matching scopes– See the matching document

fragments, including markupand annotations

<matches> <match> <sentence>The publicist of <person>Robert De Niro</person> announces that the actor has prostate cancer.</sentence> </match> <match> <sentence><person>De Niro</person> was diagnosed with cancer last week.</sentence> </match> <match> <sentence>”He’ll fight the cancer,” says <person>John Barnes</person>, founder of his Welsh fanclub. </match></matches>

xml:sentence:(“cancer” and scope(person))

//sentence[fast:matches(., “cancer”) and .//person]

Data aggregation and XML

• Constrain the aggregation to the level of matching scopes– Document-level aggregation may yield too imprecise results– Provides a more factual and relevant relation between the query and

the histogram entries– Example from Wikipedia

• Or to some other constrained scope– E.g., match documents that within the same sentence contains a

company the word ’scandal’, but aggregate over all person names that occur in sentences that contain the job title ’cfo’.

Persons that appear in documents that contain the word ‘soccer’

Persons that appear in paragraphs that contain the word ‘soccer’

Data aggregation and XML

• Aggregate not only over content, but also over structure!

• Discovery, schema exploration and disambiguation

xml:sentence:(“clinton” and scope(date))

<sentence>So, whether you are visiting for the <date base=“XXXX-07-04”>4th of July</date> weekend or an extended stay, we are confident that you will enjoy your time in <location type=“city”>Clinton</location>!</sentence>

<sentence><person>Hillary Rodham Clinton</person> was elected <location type=“country”>United States</location> Senator from <location type=“city”>New York</location> on <date base=“2000-11-07”>November 7, 2000</date>.</sentence><sentence><person>Chelsea Clinton</person> has been dating <person>Jeremy Kane</person>, a <university>Stanford University</university> classmate since <date base=“2000-09-08”>Sept. 8, 2000</date>.</sentence>

person (3)date (3)location (2)university (1)

person (2)location (1)

“Which scopes exist in my matching document fragments? “

“Inside which scopes are my query terms found?”

Information extraction

• Apply text mining techniques to identify entities of interest– Structural and semantic regions– Makes unstructured data more structured

• Mark them up in context– Grammars as scope producers– Scopes can be annotated with meta data

• Make it possible to act on the information!– E.g., make it searchable in a way that preserves context

...in fair and free elections.</sentence> <sentence><person base=“Leonid Kuchma” title=“president” >Kuchma</person> was reelected in <date base=“1999-11-XX”>November 1999</date> to another five-year term, with 56% of the vote.</sentence> <sentence>International observers criticized...

...in fair and free elections. Kuchma was reelected in November 1999 to another five-year term, with 56% of the vote. International observers criticized...

Searching with context

• High search precision and expressiveness– Semantically logical boundaries– Disambiguation– Wildcards at the semantic level– Work with normalized forms– Where there’s smoke, there’s fire– ...

• Navigation– Suggestions to broaden/refine – Discover, summarize and disambiguate

E.g., ‘bush’ as in ‘George Bush’ and not ‘burning bush’

Predicates that test for scope existence

E.g., use sentence or paragraph as the main enclosing scope

2001-09-11, Sept. 11, 11th of September, ...

E.g., use existence of a zip code to find an address

sentence → paragraph; paragraph → sentence

’clinton’ as in ’person’ or in ’location’?

Complex relations

• Relations are often of interest– E.g., (gene x disease) or (person x company x jobtitle) or (company x product) – Possible to model explicitly, if we are willing to invest the effort

• Can we avoid upfront modelling?– We want to enable ad hoc fact finding and relations– We cannot foresee everything– Even if we did, we may not want to spend the time modelling all the targeted combinations– We want to maximize document processing throughput

• Use a scoped search to relate entities– Require the entity types involved to co-occur within the same semantically meaningful scope– Filter and count over the matching scopes

• Statistics to the rescue?– Co-occurrence is no guarantee for relatedness– Implies a certain amount of redundancy to work well

<works-for person=“John Smith” company=“Microsoft” title=“Vice President”> <sentence><person>Johnny</person> works at <company>Microsoft</company>.</sentence> <sentence>He is a <jobtitle>VP</jobtitle> there.</sentence></works-for>

POS tagging

• Annotate each token with part-of-speech data– Noun, verb, adjective, ...

• Markov based taggers– Treat sequences of tags as a Markov model– Use the Viterbi algorithm to find the most probable path– Extensions to standard Markov models, e.g., using trigrams

• Transformation based taggers– Assign the most probable tags as a starting point– Use rewrite rules to modify the tags– Learning algorithm to find the best rules and their application order

POS tagging

• Annotate the text with zero or more layers of meta data– The original surface form of the text can be viewed as trivial meta data

• Apply a pattern matcher over selected layers– Handcrafted rules or trained models– Extract the surface forms that correspond to the matching patterns

• Regular expressions over the surface forms go a long way– Modern regular expression engines go beyond regular languages– Callouts are our friends! Use them as “oracles”

Layered pattern matching

... ... ... ... ... ...

Layer 2 Man Food

Layer 1 N/proper V/past/eat

DET ADJ N/singular

Layer 0 Richard ate some bad curry

Sentiment analysis

“What is the tone of this document?”

“Is there a positive spin on current news related to my brand?”

“Thumbs up or thumbs down?”

Sentiment analysis

• Based on work by Turney and Littman– Depends on initial word seeds defined as “polar opposites”– Search for interesting words and phrases in a training corpus that

occur together with the seeds, and score them– Use the scored words and phrases as the classification vocabulary

• Interesting generalizations possible– Multiple dimensions– Dimensions other than thumbs up/down

Threat Assessment

Bin Laden Message

al-Khobar

Communique #1

Hamas Decree

Communique #2

Iran Statement

Nuclear 9/11 Assessment

Port Security

Securing Homeland Plan

Terrorist Nuclear material

Terrorist Obj's

Haifa Street

asteroid

evolution

gun

Terrorist Cells

-200

-100

0

100

200

300

400

500

-350 -250 -150 -50 50 150 250 350

Timing

Sc

ale

Lo

cal

Lo

cal

Glo

bal

Glo

bal

DistantDistant ImminentImminent

Sentiment analysis and XML

• Bring the analysis down to a subdocument level– Provides a more accurate sentiment assessment– Enables precise searching– Enables retrieval of the “evidence”

• Scalar statistics aggregated over matching scopes– E.g, average sentiment degree

<sentence><company>Adidas</company> <sentiment degree="-0.6">failed to deliver</sentiment> a profit in Q1.</sentence>

<sentence>The new shoe from <company>Adidas</company> is <sentiment degree="0.9">really awesome</sentiment>!</sentence>

Task-oriented interfaces

q, e → xml:sentence:(q and scope(e))

Question answering

• Some classes of questions transform naturally to scope queries

• Statistics to the rescue!

when (is|was|did|...) X? → xml:sentence:(X’ and (scope(date) or scope(time)))

what does X stand for? →xml:acronym:(@base:X and scope(@definition))

who works (at|for) X? →xml:sentence:(company:X and scope(person) and

scope(jobtitle))

why (is|are|do|does|have|has|...) X? →xml:sentence:(X’ and (“because” or “due to” or ...))

when did the berlin wall fall? when is christmas? when was d-day?

Audio and video

• Speech-to-text software as an XML producer– Speaker identification– Scene changes– Timing information

• Contextual access to multimedia content

Contextual navigation Contextual

access

Scope

Scalable search

Information extractionXML technologies

“Sentences where someone says something positive about Adidas.”

“Sentences where the acronym ‘MIT’ is defined.”

“Dates and locations related to the query ‘d-day’.”

xml:quotation:(“gaza strip”)

xml:sentence:(“adidas” and sentiment:@degree:>0)

“Paragraphs that discuss a company merger or acquisition.”

“Paragraphs that contain quotations by George W. Bush, where he mentions a monetary amount.”

“Quotations where somebody says something about the Gaza Strip.”

xml:sentence:acronym:(@base:“mit” and scope(@definition))

xml:sentence:(“d-day” and (scope(date) or scope(location))

xml:paragraph:(string(“merger”, linguistics=“on”)

and scope(company) and scope(price))

xml:paragraph:quotation:(@speaker:”bush” and scope(price)))

Concluding thoughts

• Opportunities for semantically motivated relevancy models– Like business rules, almost

• What kind of applications would emerge...– ...if the entire web was enriched in this way?

Recommended