90
1 Scalable Information Extraction Eugene Agichtein

1 Scalable Information Extraction Eugene Agichtein

Embed Size (px)

Citation preview

Page 1: 1 Scalable Information Extraction Eugene Agichtein

1

Scalable Information Extraction

Eugene Agichtein

Page 2: 1 Scalable Information Extraction Eugene Agichtein

2

Example: Angina treatments

PDR

Web search results

Structured databases (e.g., drug info, WHO drug adverse effects DB, etc)

Medical reference and literature

MedLine

guideline for unstable angina

unstable angina management

herbal treatment for angina pain

medications for treating angina

alternative treatment for angina pain

treatment for angina

angina treatments

Page 3: 1 Scalable Information Extraction Eugene Agichtein

3

Research GoalAccurate, intuitive, and efficient access

to knowledge in unstructured sources

Approaches: Information Retrieval

Retrieve the relevant documents or passages Question answering

Human Reading Construct domain-specific “verticals” (MedLine)

Machine Reading Extract entities and relationships Network of relationships: Semantic Web

Page 4: 1 Scalable Information Extraction Eugene Agichtein

4

Semantic Relationships “Buried” in Unstructured Text

Web, newsgroups, web logs Text databases (PubMed, CiteSeer, etc.) Newspaper Archives

Corporate mergers, succession, location Terrorist attacks ] M essage

U nderstandingC onferences

…A number of well-designed and -executed large-scale clinical trials have now shown that treatment with statins reduces recurrent myocardial infarction, reduces strokes, and lessens the need for revascularization or hospitalization for unstable angina pectoris…

Drug Condition

statins recurrent myocardial

infarction

statins strokes

statins unstable angina pectoris

RecommendedTreatment

Page 5: 1 Scalable Information Extraction Eugene Agichtein

5

What Structured Representation

Can Do for You:

… allow precise and efficient querying … allow returning answers instead of documents … support powerful query constructs … allow data integration with (structured) RDBMS … provide useful content for Semantic Web

Large Text Collection Structured Relation

Page 6: 1 Scalable Information Extraction Eugene Agichtein

6

Challenges in Information Extraction

Portability Reduce effort to tune for new domains and tasks MUC systems: experts would take 8-12 weeks to tune

Scalability, Efficiency, Access Enable information extraction over large collections 1 sec / document * 5 billion docs = 158 CPU years

Approach: learn from data ( “Bootstrapping” ) Snowball: Partially Supervised Information Extraction Querying Large Text Databases for Efficient Information Extraction

Page 7: 1 Scalable Information Extraction Eugene Agichtein

7

Outline Snowball: partially supervised information

extraction (overview and key results)

Effective retrieval algorithms for information extraction (in detail)

Current: mining user behavior for web search

Future work

Page 8: 1 Scalable Information Extraction Eugene Agichtein

8

The Snowball System: Overview

Snowball

Text Database

Organization Location Conf

Microsoft Redmond 1

IBM Armonk 1

Intel Santa Clara 1

AG Edwards St Louis 0.9

Air Canada Montreal 0.8

7th Level Richardson 0.8

3Com Corp Santa Clara 0.8

3DO Redwood City 0.7

3M Minneapolis 0.7

MacWorld San Francisco 0.7

157th Street Manhattan 0.52

15th Party Congress

China 0.3

15th Century Europe

Dark Ages 0.1

3

2

... ... ..... ... ..

1

Page 9: 1 Scalable Information Extraction Eugene Agichtein

9

Snowball: Getting User Input

User input: • a handful of example instances• integrity constraints on the relation e.g., Organization is a “key”, Age > 0, etc…

GetExamples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

ACM DL 2000

Organization Headquarters

Microsoft Redmond

IBM Armonk

Intel Santa Clara

Page 10: 1 Scalable Information Extraction Eugene Agichtein

10

Can use any Can use any full-text search full-text search engineengine

Snowball: Finding Example Occurrences Get

Examples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

Search Engine

Text Database

Organization Headquarters

Microsoft Redmond

IBM Armonk

Intel Santa Clara

Computer servers at Microsoft’s headquarters in Redmond…

In mid-afternoon trading, shares of Redmond, WA-based Microsoft Corp

The Armonk-based IBM introduced a new line…

Change of guard at IBM Corporation’s headquarters near Armonk, NY ...

Page 11: 1 Scalable Information Extraction Eugene Agichtein

11

Named Named entityentity taggerstaggers can recognize can recognize DatesDates, , PeoplePeople, , LocationsLocations, , OrganizationsOrganizations, …, … MITRE’s MITRE’s AlembicAlembic, IBM’s , IBM’s TalentTalent, , LingPipeLingPipe, …, …

Snowball: Tagging EntitiesGet

Examples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

Computer servers at Microsoft ’s headquarters in Redmond…

In mid-afternoon trading, shares of Redmond, WA -based Microsoft Corp

The Armonk -based IBM introduced a new line…

Change of guard at IBM Corporation‘s headquarters near Armonk, NY ...

Page 12: 1 Scalable Information Extraction Eugene Agichtein

12

Snowball: Extraction Patterns

General extraction pattern model: acceptor0, Entity, acceptor1, Entity, acceptor2

Acceptor instantiations: String Match (accepts string “’s headquarters in”) Vector-Space (~ vector [(-’s,0.5), (headquarters, 0.5),

(in, 0.5)] ) Classifier (estimate P(T=valid | ‘s, headquarters, in) )

Computer servers at Microsoft’s headquarters in Redmond…

Page 13: 1 Scalable Information Extraction Eugene Agichtein

13

Snowball: Generating Patterns Get

Examples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

1 Represent occurrences Represent occurrences as vectors of as vectors of tagstags and and termsterms

LOCATIONORGANIZATION {<'s 0.57>, <headquarters 0.57>, < in 0.57>}

LOCATION ORGANIZATION{<- 0.71>, < based 0.71>}

LOCATIONORGANIZATION {<‘s 0.57>, <headquarters 0.57>, < near 0.57>}

LOCATION ORGANIZATION{<- 0.71>, < based 0.71>}

2 Cluster Cluster similarsimilar occurrences.occurrences.

Page 14: 1 Scalable Information Extraction Eugene Agichtein

14

Snowball: Generating Patterns Get

Examples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

LOCATIONORGANIZATION { <'s 0.71>, <headquarters 0.71>}

LOCATION ORGANIZATION{<- 0.71>, < based 0.71>}

Create Create patternspatterns as filtered as filtered clustercluster centroidscentroids

1Represent occurrences Represent occurrences as vectors of as vectors of tagstags and and termsterms

2 Cluster Cluster similarsimilar occurrences.occurrences.

3

Page 15: 1 Scalable Information Extraction Eugene Agichtein

15

Google 's new headquarters in Mountain View are …

Snowball: Extracting New TuplesMatch tagged text fragments against patterns

GetExamples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

ORGANIZATION {<'s 0.71>, <headquarters 0.71> }

{<located 0.71>, < in 0.71>}

LOCATION {<- 0.71>, <based 0.71>

P1

P2

P3

Match=0.8

Match=0.4

Match=0

ORGANIZATION

ORGANIZATION

LOCATION

LOCATION

V ORGANIZATION {<'s 0.5>, <new 0.5> <headquarters 0.5>, < in 0.5>} {<are 1>} LOCATION

Page 16: 1 Scalable Information Extraction Eugene Agichtein

16

Snowball: Evaluating Patterns

Automatically estimate Automatically estimate patternpattern confidenceconfidence::Conf(P4)= Conf(P4)= Positive / TotalPositive / Total

= 2/3 = 0.66= 2/3 = 0.66

GetExamples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

IBM, Armonk, reported… PositiveIntel, Santa Clara, introduced... Positive

“Bet on Microsoft”, New York-based analyst Jane Smith said... Negative

LOCATIONORGANIZATION { < , 1> } P4Organization Headquarters

IBM Armonk

Intel Santa Clara

Microsoft Redmond

Current seed tuples

Page 17: 1 Scalable Information Extraction Eugene Agichtein

17

Snowball: Evaluating Tuples

Automatically evaluate tuple confidence:

Conf(T) =

A tuple has high confidence if generated by high-confidence patterns.

GetExamples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

P4: 0.663COM Santa Clara

{<- 0.75>, <based 0.75>}P3: 0.95

0.4

Conf(T): 0.83

)PMatch(*)Conf(P-1-1 i

p

i

0.8

LOCATIONORGANIZATION { < , 1> }

LOCATION ORGANIZATION

Page 18: 1 Scalable Information Extraction Eugene Agichtein

18

Snowball: Evaluating TuplesGet

Examples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

Organization Headquarters Conf

Microsoft Redmond 1

IBM Armonk 1

Intel Santa Clara 1

AG Edwards St Louis 0.9

Air Canada Montreal 0.8

7th Level Richardson 0.8

3Com Corp Santa Clara 0.8

3DO Redwood City 0.7

3M Minneapolis 0.7

MacWorld San Francisco 0.7

157th Street Manhattan 0.52

15th Party Congress

China 0.3

15th Century Europe

Dark Ages 0.1

... .... ..... .... .. ... .... ..... .... ..

Keep only Keep only high-confidencehigh-confidence tuples for next iterationtuples for next iteration

Page 19: 1 Scalable Information Extraction Eugene Agichtein

19

Snowball: Evaluating TuplesGet

Examples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

Organization Headquarters Conf

Microsoft Redmond 1

IBM Armonk 1

Intel Santa Clara 1

AG Edwards St Louis 0.9

Air Canada Montreal 0.8

7th Level Richardson 0.8

3Com Corp Santa Clara 0.8

3DO Redwood City 0.7

3M Minneapolis 0.7

MacWorld San Francisco 0.7

Start new iteration with Start new iteration with expandedexpanded example setexample setIterate until no new tuples are extractedIterate until no new tuples are extracted

Page 20: 1 Scalable Information Extraction Eugene Agichtein

20

Pattern-Tuple Duality A “good” tuple:

Extracted by “good” patterns Tuple weight goodness

A “good” pattern: Generated by “good” tuples Extracts “good” new tuples Pattern weight goodness

Edge weight: Match/Similarity of tuple context

to pattern

Page 21: 1 Scalable Information Extraction Eugene Agichtein

21

How to Set Node Weights Constraint violation (from before)

Conf(P) = Log(Pos) Pos/(Pos+Neg) Conf(T) =

HITS [Hassan et al., EMNLP 2006] Conf(P) = ∑Conf(T) Conf(T) = ∑Conf(P)

URNS [Downey et al., IJCAI 2005]

EM-Spy [Agichtein, SDM 2006] Unknown tuples = Neg Compute Conf(P), Conf(T) Iterate

)PMatch(*)Conf(P-1-1 i

p

i

Page 22: 1 Scalable Information Extraction Eugene Agichtein

22

Snowball: EM-based Pattern Evaluation

Page 23: 1 Scalable Information Extraction Eugene Agichtein

23

Evaluating Patterns and Tuples: Expectation Maximization

EM-Spy Algorithm “Hide” labels for some seed

tuples

Iterate EM algorithm to convergence on tuple/pattern confidence values

Set threshold t such that (t > 90% of spy tuples)

Re-initialize Snowball using new seed tuples

Organization Headquarters Initial Final

Microsoft Redmond 1 1

IBM Armonk 1 0.8

Intel Santa Clara 1 0.9

AG Edwards St Louis 0 0.9

Air Canada Montreal 0 0.8

7th Level Richardson 0 0.8

3Com Corp Santa Clara 0 0.8

3DO Redwood City 0 0.7

3M Minneapolis 0 0.7

MacWorld San Francisco 0 0.7

0

0

157th Street Manhattan 0 0.52

15th Party Congress

China 0 0.3

15th Century Europe

Dark Ages 0 0.1

…..

Page 24: 1 Scalable Information Extraction Eugene Agichtein

24

Adapting Snowball for New Relations Large parameter space Initial seed tuples (randomly chosen, multiple runs) Acceptor features: words, stems, n-grams, phrases, punctuation, POS Feature selection techniques: OR, NB, Freq, ``support’’, combinations Feature weights: TF*IDF, TF, TF*NB, NB Pattern evaluation strategies: NN, Constraint violation, EM, EM-Spy

Automatically estimate parameter values: Estimate operating parameters based on occurrences of seed tuples Run cross-validation on hold-out sets of seed tuples for optimal perf. Seed occurrences that do not have close “neighbors” are discarded

Page 25: 1 Scalable Information Extraction Eugene Agichtein

25

Example Task 1: DiseaseOutbreaks

Proteus: 0.409Snowball: 0.415

SDM 2006

Page 26: 1 Scalable Information Extraction Eugene Agichtein

26

Example Task 2: Bioinformaticsa.k.a. mining the “bibliome”

100,000+ gene and protein synonyms extracted from 50,000+ journal articles

Approximately 40% of confirmed synonyms not previously listed in curated authoritative reference (SWISSPROT)

ISMB 2003

“APO-1, also known as DR6…”“MEK4, also called SEK1…”

Page 27: 1 Scalable Information Extraction Eugene Agichtein

27

Snowball Used in Various Domains News: NYT, WSJ, AP [DL’00, SDM’06]

CompanyHeadquarters, MergersAcquisitions, DiseaseOutbreaks

Medical literature: PDRHealth, Micromedex… [Thesis]

AdverseEffects, DrugInteractions, RecommendedTreatments

Biological literature: GeneWays corpus [ISMB’03]

Gene and Protein Synonyms

Page 28: 1 Scalable Information Extraction Eugene Agichtein

28

Limits of Bootstrapping for Extraction Task “easy” when context term distributions diverge from background

Quantify as relative entropy (Kullback-Liebler divergence)

After calibration, metric predicts if bootstrapping likely to work

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

the to and said 's company mrs won president

fre

qu

en

cy

Vw BG

CiCBGC wLM

wLMwLMLMLM

)(

)(log)()||(KL

CIKM 2005

President George W Bush’s three-day visit to India

Page 29: 1 Scalable Information Extraction Eugene Agichtein

29

Few Relations Cover Common Questions

25 relations cover > 50% of question types, 5 relations cover > 55% question instances

SIGIR 2005

Relation Type Instance

<person> discovers <concept> 7.7 2.9

<person> has position <concept> 5.6 4.6

<location> has location <location> 5.2 1.5

<person> known for <concept> 4.7 1.7

<event> has date <date> 4.1 0.9

Page 30: 1 Scalable Information Extraction Eugene Agichtein

30

Outline Snowball, a domain-independent, partially

supervised information extraction system

Retrieval algorithms for scalable information extraction

Current: mining user behavior for web search

Future work

Page 31: 1 Scalable Information Extraction Eugene Agichtein

31

Extracting A Relation From a Large Text Database

Brute force approach: feed all docs to information extraction system

Only a tiny fraction of documents are often useful Many databases are not crawlable Often a search interface is available, with existing

keyword index How to identify “useful” documents?

InformationExtraction

System

Text Database StructuredRelation

]Expensive for large collections

Page 32: 1 Scalable Information Extraction Eugene Agichtein

32

Accessing Text DBs via Search Engines

InformationExtraction

System

Text Database

StructuredRelation

Search Engine

Search engines impose limitations Limit on documents retrieved per query Support simple keywords and phrases Ignore “stopwords” (e.g., “a”, “is”)

Page 33: 1 Scalable Information Extraction Eugene Agichtein

33

Extracted Relation

QXtract: Querying Text Databases for Robust Scalable Information EXtractionUser-Provided Seed Tuples

Queries

Promising Documents

Text Database

Search Engine

DiseaseName Location Date

Malaria Ethiopia Jan. 1995

Ebola Zaire May 1995

Mad Cow Disease The U.K. July 1995

Pneumonia The U.S. Feb. 1995

DiseaseName Location Date

Malaria Ethiopia Jan. 1995

Ebola Zaire May 1995

Query Generation

Information Extraction System

Problem: Learn keyword queries to retrieve “promising” documents

Page 34: 1 Scalable Information Extraction Eugene Agichtein

34

Learning Queries to Retrieve Promising Documents

1. Get document sample with “likely negative” and “likely positive” examples.

2. Label sample documents using information extraction system as “oracle.”

3. Train classifiers to “recognize” useful documents.

4. Generate queries from classifier model/rules. Queries

Query Generation

Information Extraction System

? ???

? ?

??

++

++

- -

--

Seed Sampling

Classifier Training

tuple1tuple2tuple3tuple4tuple5

++

++

- -

--

User-Provided Seed Tuples

Text Database

Search Engine

Page 35: 1 Scalable Information Extraction Eugene Agichtein

35

Training Classifiers to Recognize “Useful” Documents

disease reported epidemic expected area

virus reported expected infected patients

products made used exported far

past old homerun sponsored event

++

--

Ripper SVM

disease AND reported => USEFUL

virus 3

infected 2

sponsored -1

Okapi (IR)

disease

infected

reported

virus

epidemic

products

usedfar

exported

Document features:

words

D1

D2

D3

D4

Page 36: 1 Scalable Information Extraction Eugene Agichtein

36

SVM

Generating Queries from Classifiers

disease and reportedepidemic

virus

QCombined

virusinfected

epidemicvirusdisease AND reported

Ripper Okapi (IR)

disease AND reported => USEFUL

disease

infected

reported

virus

epidemic

products

usedfar

exportedvirus 3

infected 2

sponsored -1

Page 37: 1 Scalable Information Extraction Eugene Agichtein

37

SIGMOD 2003 Demonstration

Page 38: 1 Scalable Information Extraction Eugene Agichtein

38

Tuples: A Simple Querying Strategy DiseaseName Location Date

Ebola Zaire May 1995

“Ebola” and “Zaire”

InformationExtraction

System

Malaria Ethiopia Jan. 1995

hemorrhagic fever Africa May 1995

1. Convert given tuples into queries2. Retrieve matching documents3. Extract new tuples from documents and

iterate

Search Engine

Page 39: 1 Scalable Information Extraction Eugene Agichtein

39

0

10

20

30

40

50

60

70

80

5% 10% 25%

M axFractionRetrieved

reca

ll (%

)

QXtract Manual Tuples Baseline

Comparison of Document Access Methods

QXtract: 60% of relation extracted from 10% of documents of 135,000 newspaper article database

Tuples strategy: Recall at most 46%

Page 40: 1 Scalable Information Extraction Eugene Agichtein

40

How to choose the best strategy?

Tuples: Simple, no training, but limited recall QXtract: Robust, but has training and query overhead Scan: No overhead, but must process all documents

Page 41: 1 Scalable Information Extraction Eugene Agichtein

41

Predicting Recall of Tuples Strategy

Seed

Tuple

SUCCESS! FAILURE

Can we predict if Tuples will succeed?

WebDB 2003

Seed

Tuple

Page 42: 1 Scalable Information Extraction Eugene Agichtein

42

Abstract the problem: Querying GraphTuples Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

“Ebola” and “Zaire”

Note: Only top K docs returned for each query. <Violence, U.S.> retrieves many documents that do not contain tuples;

searching for an extracted tuple may not retrieve source document

Search

Engine

Page 43: 1 Scalable Information Extraction Eugene Agichtein

43

Information Reachability Graph

t2, t3, and t4 “reachable” from t1t1 retrieves document d1

that contains t2

t1

t2 t3

t4t5

Tuples Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Page 44: 1 Scalable Information Extraction Eugene Agichtein

44

t2

t1

t3

t4

Connected Components

In OutCore(strongly

connected)

Reachable Tuples, do not retrieve tuples in Core

Tuples that retrieve other tuples and themselves

Tuples that retrieve other tuples but are not reachable

Page 45: 1 Scalable Information Extraction Eugene Agichtein

45

Sizes of Connected Components

OutInCor

e

OutIn Core

OutIn Core(strongly

connected)

t0

How many tuples are in largest Core + Out?

Conjecture: Degree distribution in reachability graphs follows “power-law.”

Then, reachability graph has at most one giant component.

Define Reachability as Fraction of tuples in largest Core + Out

Page 46: 1 Scalable Information Extraction Eugene Agichtein

46

NYT Reachability Graph: Outdegree Distribution

MaxResults=10

MaxResults=50

Matches the power-law distribution

Page 47: 1 Scalable Information Extraction Eugene Agichtein

47

NYT: Component Size Distribution

MaxResults=10

MaxResults=50

CG / |T| = 0.297

CG / |T| = 0.620

Not “reachable”

“reachable”

Page 48: 1 Scalable Information Extraction Eugene Agichtein

48

Connected Components Visualization

DiseaseOutbreaks, New York Times 1995

Page 49: 1 Scalable Information Extraction Eugene Agichtein

49

Estimating ReachabilityIn a power-law random graph G a giant

component CG emerges* if d (the average outdegree) > 1, and:

Estimate: Reachability ~ CG / |T| Depends only on d (average

outdegree)

* For < 3.457Chung and Lu, Annals of Combinatorics, 2002

Page 50: 1 Scalable Information Extraction Eugene Agichtein

50

Estimating Reachability Algorithm

1. Pick some random tuples

2. Use tuples to query database

3. Extract tuples from matching documents to compute reachability graph edges

4. Estimate average outdegree

5. Estimate reachability using results of Chung and Lu, Annals of Combinatorics, 2002

TuplesDocument

st1

t2

t3

t4

d1

d2

d3

d4

t1

t3

t2

t2

t4

d =1.5

Page 51: 1 Scalable Information Extraction Eugene Agichtein

51

Estimating Reachability of NYT

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

MR=1 MR=10 MR=50 MR=100 MR=200 MR=1000

MaxResults

Rea

chab

ility

S=10 S=50 S=100 S=200 Real Graph

.46

Approximate reachability is estimated after ~ 50 queries.

Can be used to predict success (or failure) of a Tuples querying strategy.

Page 52: 1 Scalable Information Extraction Eugene Agichtein

52

To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks, [Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006]

Information extraction applications extract structured relations from unstructured text

May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…

Date Disease Name Location

Jan. 1995 Malaria Ethiopia

July 1995 Mad Cow Disease U.K.

Feb. 1995 Pneumonia U.S.

May 1995 Ebola Zaire

Information Extraction System

(e.g., NYU’s Proteus)

Disease Outbreaks in The New York Times

Page 53: 1 Scalable Information Extraction Eugene Agichtein

53

An Abstract View of Text-Centric Tasks

Output tuples

…Extraction

System

Text Database

3. Extract output tuples2. Process documents1. Retrieve documents from database

Task tuple

Information Extraction Relation Tuple

Database Selection Word (+Frequency)

Focused Crawling Web Page about a Topic

For the rest of the talk

[Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006]

Page 54: 1 Scalable Information Extraction Eugene Agichtein

54

Executing a Text-Centric TaskOutput tuples

…Extraction

System

Text Database

3. Extract output tuples

2. Process documents

1. Retrieve documents from database

Similar to relational world

Two major execution paradigms Scan-based: Retrieve and process documents sequentially Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results

Unlike the relational world

Indexes are only “approximate”: index is on keywords, not on tuples of interest Choice of execution plan affects output completeness (not only speed)

→underlying data distribution dictates what is best

Page 55: 1 Scalable Information Extraction Eugene Agichtein

55

Execution Plan CharacteristicsOutput tuples

…Extraction

System

Text Database

3. Extract output tuples2. Process documents1. Retrieve documents from database

Execution Plans have two main characteristics:Execution TimeRecall (fraction of tuples retrieved)

Question: How do we choose the fastest execution plan for reaching

a target recall ?

“What is the fastest plan for discovering 10% of the disease outbreaks mentioned in The New York Times archive?”

Page 56: 1 Scalable Information Extraction Eugene Agichtein

56

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based(Index-based)

Page 57: 1 Scalable Information Extraction Eugene Agichtein

57

ScanOutput tuples

…Extraction

System

Text Database

3. Extract output tuples

2. Process documents

1. Retrieve docs from database

ScanScan retrieves and processes documents sequentially (until reaching target recall)

Execution time = |Retrieved Docs| · (R + P)

Time for retrieving a document

Question: How many documents does Scan retrieve

to reach target recall?

Time for processing a document

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)

Page 58: 1 Scalable Information Extraction Eugene Agichtein

58

Estimating Recall of ScanModeling Scan for tuple t: What is the probability of seeing t (with

frequency g(t)) after retrieving S documents? A “sampling without replacement” process

After retrieving S documents, frequency of tuple t follows hypergeometric distribution

Recall for tuple t is the probability that frequency of t in S docs > 0

t

d1

d2

dS

dN

...

D

Token

Samplingfor t

...

<SARS, China>

S documents

Probability of seeing tuple t after retrieving S

documentsg(t) = frequency of tuple t

Page 59: 1 Scalable Information Extraction Eugene Agichtein

59

Estimating Recall of ScanModeling Scan: Multiple “sampling without replacement”

processes, one for each tuple Overall recall is average recall across

tuples

→ We can compute number of documents required to reach target recall

t1 t2 tM

d1

d2

d3

dN

...

...

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

<SARS, China>

<Ebola, Zaire>

Execution time = |Retrieved Docs| · (R + P)

Page 60: 1 Scalable Information Extraction Eugene Agichtein

60

Iterative Set ExpansionOutput tuples

…Extraction

System

Text Database

3. Extract tuplesfrom docs

2. Process retrieved documents

1. Query database with seed tuples

Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q

Time for retrieving a document

Time for answering a query

Question: How many queries and how many documents

does Iterative Set Expansion need to reach target recall?

Time for processing a document

Query

Generation

4. Augment seed tuples with new tuples

Question: How many queries and how many documents

does Iterative Set Expansion need to reach target recall?

(e.g., [Ebola AND Zaire])(e.g., <Malaria, Ethiopia>)

Page 61: 1 Scalable Information Extraction Eugene Agichtein

61

Using Querying Graph for Analysis

We need to compute the: Number of documents retrieved after

sending Q tuples as queries (estimates time) Number of tuples that appear in the

retrieved documents (estimates recall)

To estimate these we need to compute the: Degree distribution of the tuples

discovered by retrieving documents Degree distribution of the documents

retrieved by the tuples (Not the same as the degree distribution of a

randomly chosen tuple or document – it is easier to discover documents and tuples with high degrees)

tuples Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

<SARS, China>

<Ebola, Zaire>

<Malaria, Ethiopia>

<Cholera, Sudan>

<H5N1, Vietnam>

Page 62: 1 Scalable Information Extraction Eugene Agichtein

62

Summary of Cost Analysis

Our analysis so far: Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity, if plan cannot reach target recall)

Time and recall depend on task-specific properties of database: tuple degree distribution Document degree distribution

Next, we show how to estimate degree distributions on-the-fly

Page 63: 1 Scalable Information Extraction Eugene Agichtein

63

Estimating Cost Model Parameterstuple and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters!

Task Document Distribution tuple Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-3.3863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 5492.2x-2.0254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

Page 64: 1 Scalable Information Extraction Eugene Agichtein

64

Parameter Estimation

Naïve solution for parameter estimation: Start with separate, “parameter-estimation” phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this!

No need for separate sampling phase Sampling is equivalent to executing the task:

→Piggyback parameter estimation into execution

Page 65: 1 Scalable Information Extraction Eugene Agichtein

65

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming “default” parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as “random sampling”All other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial, default estimationUpdated estimationUpdated estimation

Page 66: 1 Scalable Information Extraction Eugene Agichtein

66

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

Page 67: 1 Scalable Information Extraction Eugene Agichtein

67

Correctness of Theoretical Analysis

Solid lines: Actual time Dotted lines: Predicted time with correct parameters

Task: Disease Outbreaks

Snowball IE system

182,531 documents from NYT

16,921 tuples

100

1,000

10,000

100,000

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt. Scan

Automatic Query Gen.

Iterative Set Expansion

Page 68: 1 Scalable Information Extraction Eugene Agichtein

68

Experimental Results (Information Extraction)

Solid lines: Actual time Green line: Time with optimizer

(results similar in other experiments – see paper)

100

1,000

10,000

100,000

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt. Scan

Iterative Set Expansion

Automatic Query Gen.

OPTIMIZED

Page 69: 1 Scalable Information Extraction Eugene Agichtein

69

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

Page 70: 1 Scalable Information Extraction Eugene Agichtein

70

Can we do better?

Yes. For some information extraction systems

Page 71: 1 Scalable Information Extraction Eugene Agichtein

71

Bindings Engine (BE) [Slides: Cafarella 2005] Bindings Engine (BE) is search engine where:

No downloads during query processing Disk seeks constant in corpus size #queries = #phrases

BE’s approach: “Variabilized” search query language Pre-processes all documents before query-time Integrates variable/type data with inverted index,

minimizing query seeks

Page 72: 1 Scalable Information Extraction Eugene Agichtein

72

BE Query Support

cities such as <NounPhrase>

President Bush <Verb>

<NounPhrase> is the capital of <NounPhrase>

reach me at <phone-number> Any sequence of concrete terms and typed

variables NEAR is insufficient Functions (e.g., “head(<NounPhrase>)”)

Page 73: 1 Scalable Information Extraction Eugene Agichtein

73

BE Operation Like a generic search engine, BE:

Downloads a corpus of pages Creates an index Uses index to process queries efficiently

BE further requires: Set of indexed types (e.g., “NounPhrase”), with a

“recognizer” for each String processing functions (e.g., “head()”)

A BE system can only process types and functions that its index supports

Page 74: 1 Scalable Information Extraction Eugene Agichtein

74

as

billy

cities

friendly

give

mayors

nickels

seattle

such

words

#docs docid0 docid1

#docs docid0

#docs docid0 docid1 docid2 docid#docs-1…#docs docid0 docid1 docid2

#docs docid0 docid1

#docs docid0 docid1 docid2 docid#docs-1…

#docs docid0

#docs docid0 docid1 docid2 docid3

#docs docid0 docid1 docid2 docid#docs-1…

#docs docid0 docid1 docid2 docid#docs-1…

Page 75: 1 Scalable Information Extraction Eugene Agichtein

75

as

billy

cities

friendly

give

mayors

nickels

seattle

such

words

#docs docid0 docid1 docid2 docid#docs-1…

104 21 150 322 2501

15 99 322 426 1309

1.Test for equality2.Advance smaller pointer3.Abort when a list is

exhausted

Returned docs: 322

Query: such as

#docs docid0 docid1 docid2 docid#docs-1…

Page 76: 1 Scalable Information Extraction Eugene Agichtein

76

as

billy

cities

friendly

give

mayors

nickels

seattle

such

words

#docs …pos0 pos1 docid#docs-1 pos#docs-1

#posns pos0 pos1… pos#pos-1

docid0 docid1

#docs …pos0 pos1 docid#docs-1 pos#docs-1docid0 docid1

#posns pos0 pos1… pos#pos-1

In phrase queries, match positions as well

#docs docid0 docid1 docid2 docid#docs-1…

#docs docid0 docid1 docid2 docid#docs-1…

“such as”

Page 77: 1 Scalable Information Extraction Eugene Agichtein

77

Neighbor Index

At each position in the index, store “neighbor text” that might be useful

Let’s index <NounPhrase> and <Adj-Term>

“I love cities such as Atlanta.”

Left Right

AdjT: “love”

Page 78: 1 Scalable Information Extraction Eugene Agichtein

78

Neighbor Index

At each position in the index, store “neighbor text” that might be useful

Let’s index <NounPhrase> and <Adj-Term>

“I love cities such as Atlanta.”

Left Right

AdjT: “cities”NP: “cities”

AdjT: “I”NP: “I”

Page 79: 1 Scalable Information Extraction Eugene Agichtein

79

Neighbor Index

Left Right

AdjT: “such”

Query: “cities such as <NounPhrase>”

AdjT: “Atlanta”NP: “Atlanta”

“I love cities such as Atlanta.”

Page 80: 1 Scalable Information Extraction Eugene Agichtein

80

neighbor1 str1

NPright Atlanta

as

billy

cities

friendly

give

mayors

nickels

Atlanta

such

words

#docs …pos0 pos1 docid#docs-1 pos#docs-1

#posns pos0 pos1… pos#pos-1

docid0 docid1

“cities such as <NounPhrase>”

1. Find phrase query positions, as with phrase queries2. If term is adjacent to variable, extract typed value

#posns pos0 neighbor0 pos1 neighbor1… pos#pos-1 …

#neighborsblk_offset

3<offset>

19

12

In doc 19, starting at posn 8:

“I love cities such as Atlanta.”

neighbor0 str0

AdjTleft such

Page 81: 1 Scalable Information Extraction Eugene Agichtein

81

Current Research Directions Modeling explicit and Implicit network structures

Modeling evolution of explicit structure on web, blogspace, wikipedia Modeling implicit link structures in text, collections, web Exploiting implicit & explicit social networks (e.g., for epidemiology)

Knowledge Discovery from Biological and Medical Data Automatic sequence annotation bioinformatics, genetics Actionable knowledge extraction from medical articles

Robust information extraction, retrieval, and query processing Integrating information in structured and unstructured sources Robust search/question answering for medical applications Confidence estimation for extraction from text and other sources Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance) Accuracy (!=authority) of online sources

Information diffusion/propagation in online sources Information propagation on the web In collaborative sources (wikipedia, MedLine)

Page 82: 1 Scalable Information Extraction Eugene Agichtein

82

Page Quality: In Search of an Unbiased Web Ranking[Cho, Roy, Adams, SIGMOD 2005] “popular pages tend to get even more popular, while

unpopular pages get ignored by an average user”

Page 83: 1 Scalable Information Extraction Eugene Agichtein

83

Sic Transit Gloria Telae: Towards an Understanding of theWeb’s Decay [Bar-Yossef, Broder, Kumar, Tomkins, WWW 2004]

Page 84: 1 Scalable Information Extraction Eugene Agichtein

84

Modeling Social Networks for Epidemiology, security, …

Email exchange mapped onto cubicle locations.

Page 85: 1 Scalable Information Extraction Eugene Agichtein

85

Some Research Directions Modeling explicit and Implicit network structures

Modeling evolution of explicit structure on web, blogspace, wikipedia Modeling implicit link structures in text, collections, web Exploiting implicit & explicit social networks (e.g., for epidemiology)

Knowledge Discovery from Biological and Medical Data Automatic sequence annotation bioinformatics, genetics Actionable knowledge extraction from medical articles

Robust information extraction, retrieval, and query processing Integrating information in structured and unstructured sources Query processing over unstructured text Robust search/question answering for medical applications Confidence estimation for extraction from text and other sources Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance)

Information diffusion/propagation in online sources Information propagation on the web In collaborative sources (wikipedia, MedLine)

Page 86: 1 Scalable Information Extraction Eugene Agichtein

86

Mining Text and Sequence Data

Agichtein & Eskin, PSB 2004

ROC50 scores for each class and method

Page 87: 1 Scalable Information Extraction Eugene Agichtein

87

Some Research Directions Modeling explicit and Implicit network structures

Modeling evolution of explicit structure on web, blogspace, wikipedia Modeling implicit link structures in text, collections, web Exploiting implicit & explicit social networks (e.g., for epidemiology)

Knowledge Discovery from Biological and Medical Data Automatic sequence annotation bioinformatics, genetics Actionable knowledge extraction from medical articles

Robust information extraction, retrieval, and query processing Integrating information in structured and unstructured sources Robust search/question answering for medical applications Confidence estimation for extraction from text and other sources Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance) Accuracy (!=authority) of online sources

Information diffusion/propagation in online sources Information propagation on the web In collaborative sources (wikipedia, MedLine)

Page 88: 1 Scalable Information Extraction Eugene Agichtein

88

Structure and evolution of blogspace [Kumar, Novak, Raghavan, Tomkins, CACM 2004, KDD 2006]

Fraction of nodes in components of various sizes within Flickr and Yahoo! 360 timegraph, by week.

Page 89: 1 Scalable Information Extraction Eugene Agichtein

89

Current Research Directions Modeling explicit and Implicit network structures

Modeling evolution of explicit structure on web, blogspace, wikipedia Modeling implicit link structures in text, collections, web Exploiting implicit & explicit social networks (e.g., for epidemiology)

Knowledge Discovery from Biological and Medical Data Automatic sequence annotation bioinformatics, genetics Actionable knowledge extraction from medical articles

Robust information extraction, retrieval, and query processing Integrating information in structured and unstructured sources Robust search/question answering for medical applications Confidence estimation for extraction from text and other sources Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance) Accuracy (!=authority) of online sources

Information diffusion/propagation in online sources Information propagation on the web, news In collaborative sources (wikipedia, MedLine)

Page 90: 1 Scalable Information Extraction Eugene Agichtein

90

Thank You

Details:http://www.mathcs.emory.edu/~eugene/