32
Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

Embed Size (px)

Citation preview

Page 1: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

Surfacing Information in Large Text Collections

Eugene AgichteinMicrosoft Research

Page 2: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

2

Example: Angina treatments

PDR

Web search results

Structured databases (e.g., drug info, WHO drug adverse effects DB, etc)

Medical reference and literature

MedLine

guideline for unstable angina

unstable angina management

herbal treatment for angina pain

medications for treating angina

alternative treatment for angina pain

treatment for angina

angina treatments

Page 3: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

3

Research Goal

Seamless, intuitive, efficient, and robust access to knowledge in unstructured sources

Some approaches: Retrieve the relevant documents or passages Question answering Construct domain-specific “verticals” (MedLine) Extract entities and relationships Network of relationships: Semantic Web

Page 4: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

4

Semantic Relationships “Buried” in Unstructured Text

Web, newsgroups, web logs Text databases (PubMed, CiteSeer, etc.) Newspaper Archives

• Corporate mergers, succession, location• Terrorist attacks ] M essage

U nderstandingC onferences

…A number of well-designed and -executed large-scale clinical trials have now shown that treatment with statins reduces recurrent myocardial infarction, reduces strokes, and lessens the need for revascularization or hospitalization for unstable angina pectoris…

Drug Condition

statins recurrent myocardial infarction

statins strokes

statins unstable angina pectoris

RecommendedTreatment

Page 5: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

5

What Structured Representation Can Do for You:

… allow precise and efficient querying … allow returning answers instead of documents … support powerful query constructs … allow data integration with (structured) RDBMS … provide useful content for Semantic Web

Large Text Collection Structured Relation

Page 6: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

6

Challenges in Information Extraction

Portability• Reduce effort to tune for new domains and tasks

• MUC systems: experts would take 8-12 weeks to tune

Scalability, Efficiency, Access• Enable information extraction over large collections

• 1 sec / document * 5 billion docs = 158 CPU years

Approach: learn from data ( “Bootstrapping” )• Snowball: Partially Supervised Information Extraction

• Querying Large Text Databases for Efficient Information Extraction

Page 7: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

7

The Snowball System: Overview

Snowball

Text Database

Organization Location Conf

Microsoft Redmond 1

IBM Armonk 1

Intel Santa Clara 1

AG Edwards St Louis 0.9

Air Canada Montreal 0.8

7th Level Richardson 0.8

3Com Corp Santa Clara 0.8

3DO Redwood City 0.7

3M Minneapolis 0.7

MacWorld San Francisco 0.7

157th Street Manhattan 0.52

15th Party Congress

China 0.3

15th Century Europe

Dark Ages 0.1

3

2

... ... ..... ... ..

1

Page 8: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

8

Snowball: Getting User Input

User input: • a handful of example instances• integrity constraints on the relation e.g., Organization is a “key”, Age > 0, etc…

GetExamples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

ACM DL 2000

Organization Headquarters

Microsoft Redmond

IBM Armonk

Intel Santa Clara

Page 9: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

9

Evaluating Patterns and Tuples:Expectation Maximization

EM-Spy Algorithm• “Hide” labels for some seed

tuples

• Iterate EM algorithm to convergence on tuple/pattern confidence values

• Set threshold t such that (t > 90% of spy tuples)

• Re-initialize Snowball using new seed tuples

Organization Headquarters Initial Final

Microsoft Redmond 1 1

IBM Armonk 1 0.8

Intel Santa Clara 1 0.9

AG Edwards St Louis 0 0.9

Air Canada Montreal 0 0.8

7th Level Richardson 0 0.8

3Com Corp Santa Clara 0 0.8

3DO Redwood City 0 0.7

3M Minneapolis 0 0.7

MacWorld San Francisco 0 0.7

0

0

157th Street Manhattan 0 0.52

15th Party Congress

China 0 0.3

15th Century Europe

Dark Ages 0 0.1

…..

Page 10: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

10

Adapting Snowball for New Relations

Large parameter space• Initial seed tuples (randomly chosen, multiple runs)

• Acceptor features: words, stems, n-grams, phrases, punctuation, POS

• Feature selection techniques: OR, NB, Freq, ``support’’, combinations

• Feature weights: TF*IDF, TF, TF*NB, NB

• Pattern evaluation strategies: NN, Constraint violation, EM, EM-Spy

Automatically estimate parameter values:• Estimate operating parameters based on occurrences of seed tuples

• Run cross-validation on hold-out sets of seed tuples for optimal perf.

• Seed occurrences that do not have close “neighbors” are discarded

Page 11: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

11

Example Task 1: DiseaseOutbreaks

Proteus: 0.409Snowball: 0.415

SDM 2006

Page 12: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

12

Example Task 2: Bioinformatics 100,000+ gene and protein

synonyms extracted from 50,000+ journal articles

Approximately 40% of confirmed synonyms not previously listed in curated authoritative reference (SWISSPROT)

ISMB 2003

“APO-1, also known as DR6…”“MEK4, also called SEK1…”

Page 13: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

13

Snowball Used in Various Domains News: NYT, WSJ, AP [DL’00, SDM’06]

•CompanyHeadquarters, MergersAcquisitions, DiseaseOutbreaks

Medical literature: PDRHealth, Micromedex… [Ph.D. Thesis]

•AdverseEffects, DrugInteractions, RecommendedTreatments

Biological literature: GeneWays corpus [ISMB’03]

•Gene and Protein Synonyms

Page 14: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

14

Limits of Bootstrapping for Extraction Task “easy” when context term distributions diverge from

background

Quantify as relative entropy (Kullback-Liebler divergence)

After calibration, metric predicts if bootstrapping likely to work

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

the to and said 's company mrs won president

fre

qu

en

cy

Vw BG

CiCBGC wLM

wLMwLMLMLM

)(

)(log)()||(KL

CIKM 2005

President George W Bush’s three-day visit to India

Page 15: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

15

Extracting All Relation Instances From a Text Database

Brute force approach: feed all docs to information extraction system

Only a tiny fraction of documents are often useful Many databases are not crawlable Often a search interface is available, with existing

keyword index How to identify “useful” documents?

InformationExtraction

System

Text Database StructuredRelation

]Expensive for large collections

Page 16: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

16

Accessing Text DBs via Search Engines

InformationExtraction

System

Text Database

StructuredRelation

Search Engine

Search engines impose limitations• Limit on documents retrieved per query

• Support simple keywords and phrases

• Ignore “stopwords” (e.g., “a”, “is”)

Page 17: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

17

Text-Centric Task I: Information Extraction

Information extraction applications extract structured relations from unstructured text

May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…

Date Disease Name Location

Jan. 1995 Malaria Ethiopia

July 1995 Mad Cow Disease

U.K.

Feb. 1995 Pneumonia U.S.

May 1995 Ebola Zaire

Information Extraction System

(e.g., NYU’s Proteus)

Disease Outbreaks in The New York Times

Information Extraction tutorial yesterday by AnHai Doan, Raghu Ramakrishnan, Shivakumar Vaithyanathan

Page 18: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

18

Executing a Text-Centric TaskOutput Tokens

…Extraction

System

Text Database

3. Extract output tokens

2. Process documents

1. Retrieve documents from database

Similar to relational world

Two major execution paradigms Scan-based: Retrieve and process documents sequentially Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results

Unlike the relational world

Indexes are only “approximate”: index is on keywords, not on tokens of interest Choice of execution plan affects output completeness (not only speed)

→underlying data distribution dictates what is best

Page 19: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

19

Extracted Relation

QXtract: Querying Text Databases for Robust Scalable Information EXtraction

User-Provided Seed Tuples

Queries

Promising Documents

Text Database

Search Engine

DiseaseName Location Date

Malaria Ethiopia Jan. 1995

Ebola Zaire May 1995

Mad Cow Disease The U.K. July 1995

Pneumonia The U.S. Feb. 1995

DiseaseName Location Date

Malaria Ethiopia Jan. 1995

Ebola Zaire May 1995

Query Generation

Information Extraction System

Problem: Learn keyword queries to retrieve “promising” documents

Page 20: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

20

Learning Queries to Retrieve Promising Documents

1. Get document sample with “likely negative” and “likely positive” examples.

2. Label sample documents using information extraction system as “oracle.”

3. Train classifiers to “recognize” useful documents.

4. Generate queries from classifier model/rules. Queries

Query Generation

Information Extraction System

? ???

? ?

??

++

++

- -

--

Seed Sampling

Classifier Training

tuple1tuple2tuple3tuple4tuple5

++

++

- -

--

User-Provided Seed Tuples

Text Database

Search Engine

Page 21: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

21

SIGMOD 2003 Demonstration

Page 22: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

22

Querying Graph

The querying graph is a bipartite graph, containing tokens and documents

Each token (transformed to a keyword query) retrieves documents

Documents contain tokens

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

<SARS, China>

<Ebola, Zaire>

<Malaria, Ethiopia>

<Cholera, Sudan>

<H5N1, Vietnam>

Page 23: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

23

Sizes of Connected Components

OutInCor

e

OutIn Core

OutIn Core(strongly

connected)

t0

How many tuples are in largest Core + Out?

Conjecture:• Degree distribution in reachability graphs follows “power-law.”

• Then, reachability graph has at most one giant component.

Define Reachability as Fraction of tuples in largest Core + Out

Page 24: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

24

NYT Reachability Graph: Outdegree Distribution

MaxResults=10

MaxResults=50

Matches the power-law distribution

Page 25: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

25

NYT: Component Size Distribution

MaxResults=10

MaxResults=50

CG / |T| = 0.297

CG / |T| = 0.620

Not “reachable”

“reachable”

Page 26: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

26

Connected Components Visualization

DiseaseOutbreaks, New York Times 1995

Page 27: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

27

Estimate Cost of Retrieval Methods

Alternatives:• Scan, Filtered Scan, Tuples, QXtract

General cost model for text-centric tasks• Information extraction, summary construction, etc…

Estimate the expected cost of each access method• Parametric model describing all retrieval steps• Extended analysis to arbitrary degree distributions • Parameters estimates can be “piggybacked” at runtime

Cost estimates can be provided to a query optimizer for nearly optimal execution

SIGMOD 2006

Page 28: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

28

Optimized Execution of Text-Centric Tasks

Tuples

Filtered Scan

Scan

Page 29: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

29

Current Research Agenda Seamless, intuitive, and robust access to

knowledge in biologicial and medical sources

Some research problems:• Robust query processing over unstructured data

• Intelligently interpreting user information needs

• Text mining for bio- and medical informatics

• Model implicit network structures:• Entity graphs in Wikipedia

• Protein-Protein interaction networks

• Semantic maps of MedLine

Page 30: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

30

Deriving Actionable Knowledge from Unstructured (text) Data

Extract actionable rules from medical text(Medline, patient reports, …)• Joint project (early stages) with medical school, GT

Epidemiology surveillance (w/ SPH)

Query processing over unstructured data• Tune extraction for query workload

• Index structures to support effective extraction

• Queries over extracted and “native” tables

Page 31: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

31

Text Mining for Bioinformatics

Impossible to keep up with literature, experimental notes

Automatically update ontologies, indexes

Automate tedious work of post-wetlab search

Identify (and assign text label) DNA structures

Page 32: Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

32

Mining Text and Sequence DataPSB 2004

ROC50 scores for each class and method