Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

Surfacing Information in Large Text Collections

Eugene AgichteinMicrosoft Research

2

Example: Angina treatments

PDR

Web search results

Structured databases (e.g., drug info, WHO drug adverse effects DB, etc)

Medical reference and literature

MedLine

guideline for unstable angina

unstable angina management

herbal treatment for angina pain

medications for treating angina

alternative treatment for angina pain

treatment for angina

angina treatments

3

Research Goal

Seamless, intuitive, efficient, and robust access to knowledge in unstructured sources

Some approaches: Retrieve the relevant documents or passages Question answering Construct domain-specific “verticals” (MedLine) Extract entities and relationships Network of relationships: Semantic Web

4

Semantic Relationships “Buried” in Unstructured Text

Web, newsgroups, web logs Text databases (PubMed, CiteSeer, etc.) Newspaper Archives

• Corporate mergers, succession, location• Terrorist attacks ] M essage

U nderstandingC onferences

…A number of well-designed and -executed large-scale clinical trials have now shown that treatment with statins reduces recurrent myocardial infarction, reduces strokes, and lessens the need for revascularization or hospitalization for unstable angina pectoris…

Drug Condition

statins recurrent myocardial infarction

statins strokes

statins unstable angina pectoris

RecommendedTreatment

5

What Structured Representation Can Do for You:

… allow precise and efficient querying … allow returning answers instead of documents … support powerful query constructs … allow data integration with (structured) RDBMS … provide useful content for Semantic Web

Large Text Collection Structured Relation

6

Challenges in Information Extraction

Portability• Reduce effort to tune for new domains and tasks

• MUC systems: experts would take 8-12 weeks to tune

Scalability, Efficiency, Access• Enable information extraction over large collections

• 1 sec / document * 5 billion docs = 158 CPU years

Approach: learn from data ( “Bootstrapping” )• Snowball: Partially Supervised Information Extraction

• Querying Large Text Databases for Efficient Information Extraction

7

The Snowball System: Overview

Snowball

Text Database

Organization Location Conf

Microsoft Redmond 1

IBM Armonk 1

Intel Santa Clara 1

AG Edwards St Louis 0.9

Air Canada Montreal 0.8

7th Level Richardson 0.8

3Com Corp Santa Clara 0.8

3DO Redwood City 0.7

3M Minneapolis 0.7

MacWorld San Francisco 0.7

157th Street Manhattan 0.52

15th Party Congress

China 0.3

15th Century Europe

Dark Ages 0.1

3

2

... ... ..... ... ..

1

8

Snowball: Getting User Input

User input: • a handful of example instances• integrity constraints on the relation e.g., Organization is a “key”, Age > 0, etc…

GetExamples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

ACM DL 2000

Organization Headquarters

Microsoft Redmond

IBM Armonk

Intel Santa Clara

9

Evaluating Patterns and Tuples:Expectation Maximization

EM-Spy Algorithm• “Hide” labels for some seed

tuples

• Iterate EM algorithm to convergence on tuple/pattern confidence values

• Set threshold t such that (t > 90% of spy tuples)

• Re-initialize Snowball using new seed tuples

Organization Headquarters Initial Final

Microsoft Redmond 1 1

IBM Armonk 1 0.8

Intel Santa Clara 1 0.9

AG Edwards St Louis 0 0.9

Air Canada Montreal 0 0.8

7th Level Richardson 0 0.8

3Com Corp Santa Clara 0 0.8

3DO Redwood City 0 0.7

3M Minneapolis 0 0.7

MacWorld San Francisco 0 0.7

0

0

157th Street Manhattan 0 0.52

15th Party Congress

China 0 0.3

15th Century Europe

Dark Ages 0 0.1

…..

10

Adapting Snowball for New Relations

Large parameter space• Initial seed tuples (randomly chosen, multiple runs)

• Acceptor features: words, stems, n-grams, phrases, punctuation, POS

• Feature selection techniques: OR, NB, Freq, ``support’’, combinations

• Feature weights: TF*IDF, TF, TF*NB, NB

• Pattern evaluation strategies: NN, Constraint violation, EM, EM-Spy

Automatically estimate parameter values:• Estimate operating parameters based on occurrences of seed tuples

• Run cross-validation on hold-out sets of seed tuples for optimal perf.

• Seed occurrences that do not have close “neighbors” are discarded

11

Example Task 1: DiseaseOutbreaks

Proteus: 0.409Snowball: 0.415

SDM 2006

12

Example Task 2: Bioinformatics 100,000+ gene and protein

synonyms extracted from 50,000+ journal articles

Approximately 40% of confirmed synonyms not previously listed in curated authoritative reference (SWISSPROT)

ISMB 2003

“APO-1, also known as DR6…”“MEK4, also called SEK1…”

13

Snowball Used in Various Domains News: NYT, WSJ, AP [DL’00, SDM’06]

•CompanyHeadquarters, MergersAcquisitions, DiseaseOutbreaks

Medical literature: PDRHealth, Micromedex… [Ph.D. Thesis]

•AdverseEffects, DrugInteractions, RecommendedTreatments

Biological literature: GeneWays corpus [ISMB’03]

•Gene and Protein Synonyms

14

Limits of Bootstrapping for Extraction Task “easy” when context term distributions diverge from

background

Quantify as relative entropy (Kullback-Liebler divergence)

After calibration, metric predicts if bootstrapping likely to work

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

the to and said 's company mrs won president

fre

qu

en

cy

Vw BG

CiCBGC wLM

wLMwLMLMLM

)(

)(log)()||(KL

CIKM 2005

President George W Bush’s three-day visit to India

15

Extracting All Relation Instances From a Text Database

Brute force approach: feed all docs to information extraction system

Only a tiny fraction of documents are often useful Many databases are not crawlable Often a search interface is available, with existing

keyword index How to identify “useful” documents?

InformationExtraction

System

Text Database StructuredRelation

]Expensive for large collections

16

Accessing Text DBs via Search Engines

InformationExtraction

System

Text Database

StructuredRelation

Search Engine

Search engines impose limitations• Limit on documents retrieved per query

• Support simple keywords and phrases

• Ignore “stopwords” (e.g., “a”, “is”)

17

Text-Centric Task I: Information Extraction

Information extraction applications extract structured relations from unstructured text

May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…

Date Disease Name Location

Jan. 1995 Malaria Ethiopia

July 1995 Mad Cow Disease

U.K.

Feb. 1995 Pneumonia U.S.

May 1995 Ebola Zaire

Information Extraction System

(e.g., NYU’s Proteus)

Disease Outbreaks in The New York Times

Information Extraction tutorial yesterday by AnHai Doan, Raghu Ramakrishnan, Shivakumar Vaithyanathan

18

Executing a Text-Centric TaskOutput Tokens

…Extraction

System

Text Database

3. Extract output tokens

2. Process documents

1. Retrieve documents from database

Similar to relational world

Two major execution paradigms Scan-based: Retrieve and process documents sequentially Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results

Unlike the relational world

Indexes are only “approximate”: index is on keywords, not on tokens of interest Choice of execution plan affects output completeness (not only speed)

→underlying data distribution dictates what is best

19

Extracted Relation

QXtract: Querying Text Databases for Robust Scalable Information EXtraction

User-Provided Seed Tuples

Queries

Promising Documents

Text Database

Search Engine

DiseaseName Location Date

Malaria Ethiopia Jan. 1995

Ebola Zaire May 1995

Mad Cow Disease The U.K. July 1995

Pneumonia The U.S. Feb. 1995

DiseaseName Location Date

Malaria Ethiopia Jan. 1995

Ebola Zaire May 1995

Query Generation


Problem: Learn keyword queries to retrieve “promising” documents

20

Learning Queries to Retrieve Promising Documents

1. Get document sample with “likely negative” and “likely positive” examples.

2. Label sample documents using information extraction system as “oracle.”

3. Train classifiers to “recognize” useful documents.

4. Generate queries from classifier model/rules. Queries

Query Generation


? ???

? ?

??

++

++

- -

--

Seed Sampling

Classifier Training

tuple1tuple2tuple3tuple4tuple5

++

++

- -

--

User-Provided Seed Tuples

Text Database

Search Engine

21

SIGMOD 2003 Demonstration

22

Querying Graph

The querying graph is a bipartite graph, containing tokens and documents

Each token (transformed to a keyword query) retrieves documents

Documents contain tokens

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

<SARS, China>

<Ebola, Zaire>

<Malaria, Ethiopia>

<Cholera, Sudan>

<H5N1, Vietnam>

23

Sizes of Connected Components

OutInCor

e

OutIn Core

OutIn Core(strongly

connected)

t0

How many tuples are in largest Core + Out?

Conjecture:• Degree distribution in reachability graphs follows “power-law.”

• Then, reachability graph has at most one giant component.

Define Reachability as Fraction of tuples in largest Core + Out

24

NYT Reachability Graph: Outdegree Distribution

MaxResults=10

MaxResults=50

Matches the power-law distribution

25

NYT: Component Size Distribution

MaxResults=10

MaxResults=50

CG / |T| = 0.297

CG / |T| = 0.620

Not “reachable”

“reachable”

26

Connected Components Visualization

DiseaseOutbreaks, New York Times 1995

27

Estimate Cost of Retrieval Methods

Alternatives:• Scan, Filtered Scan, Tuples, QXtract

General cost model for text-centric tasks• Information extraction, summary construction, etc…

Estimate the expected cost of each access method• Parametric model describing all retrieval steps• Extended analysis to arbitrary degree distributions • Parameters estimates can be “piggybacked” at runtime

Cost estimates can be provided to a query optimizer for nearly optimal execution

SIGMOD 2006

28

Optimized Execution of Text-Centric Tasks

Tuples

Filtered Scan

Scan

29

Current Research Agenda Seamless, intuitive, and robust access to

knowledge in biologicial and medical sources

Some research problems:• Robust query processing over unstructured data

• Intelligently interpreting user information needs

• Text mining for bio- and medical informatics

• Model implicit network structures:• Entity graphs in Wikipedia

• Protein-Protein interaction networks

• Semantic maps of MedLine

30

Deriving Actionable Knowledge from Unstructured (text) Data

Extract actionable rules from medical text(Medline, patient reports, …)• Joint project (early stages) with medical school, GT

Epidemiology surveillance (w/ SPH)

Query processing over unstructured data• Tune extraction for query workload

• Index structures to support effective extraction

• Queries over extracted and “native” tables

31

Text Mining for Bioinformatics

Impossible to keep up with literature, experimental notes

Automatically update ontologies, indexes

Automate tedious work of post-wetlab search

Identify (and assign text label) DNA structures

32

Mining Text and Sequence DataPSB 2004

ROC50 scores for each class and method