115
Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton, Aron Culotta, Ben Wellner, Khashayar Rohanimanesh, Wei Li, Andres Corrada, Xuerui Wang

Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Embed Size (px)

Citation preview

Page 1: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Toward Unified Graphical Models ofInformation Extraction and Data Mining

Andrew McCallum

Computer Science Department

University of Massachusetts Amherst

Joint work with

Charles Sutton, Aron Culotta, Ben Wellner, Khashayar Rohanimanesh,

Wei Li, Andres Corrada, Xuerui Wang

Page 2: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Goal:

Mine actionable knowledgefrom unstructured text.

Page 3: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Extracting Job Openings from the Web

foodscience.com-Job2

JobTitle: Ice Cream Guru

Employer: foodscience.com

JobCategory: Travel/Hospitality

JobFunction: Food Services

JobLocation: Upper Midwest

Contact Phone: 800-488-2611

DateExtracted: January 8, 2001

Source: www.foodscience.com/jobs_midwest.html

OtherCompanyJobs: foodscience.com-Job1

Page 4: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

IE fromChinese Documents regarding Weather

Department of Terrestrial System, Chinese Academy of Sciences

200k+ documentsseveral millennia old

- Qing Dynasty Archives- memos- newspaper articles- diaries

Page 5: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

IE from Research Papers[McCallum et al ‘99]

Page 6: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

IE from Research Papers

Page 7: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Mining Research Papers

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

[Giles et al]

[Rosen-Zvi, Griffiths, Steyvers, Smyth, 2004]

Page 8: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

What is “Information Extraction”

Information Extraction = segmentation + classification + clustering + association

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

Page 9: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

What is “Information Extraction”

Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

Page 10: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

What is “Information Extraction”

Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

Page 11: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

What is “Information Extraction”

Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

NAME

TITLE ORGANIZATION

Bill Gates

CEO

Microsoft

Bill Veghte

VP

Microsoft

Richard Stallman

founder

Free Soft..

*

*

*

*

Page 12: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Larger Context

SegmentClassifyAssociateCluster

Filter

Prediction Outlier detection Decision support

IE

Documentcollection

Database

Discover patterns - entity types - links / relations - events

DataMining

Spider

Actionableknowledge

Page 13: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Outline

• Examples of IE and Data Mining.

• Brief review of Conditional Random Fields

• Joint inference: Motivation and examples

– Joint Labeling of Cascaded Sequences (Belief Propagation)

– Joint Labeling of Distant Entities (BP by Tree Reparameterization)

– Joint Co-reference Resolution (Graph Partitioning)

– Joint Segmentation and Co-ref (Iterated Conditional Samples)

• Piecewise Training for large multi-component models

• Two example projects

– Email, contact management, and Social Network Analysis

– Research Paper search and analysis

Page 14: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Hidden Markov Models

St - 1

St

Ot

St+1

Ot +1

Ot -1

...

...

Finite state model Graphical model

Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st )Training: Maximize probability of training observations (w/ prior)

∏=

−∝||

11 )|()|(),(

o

ttttt soPssPosP

vvv

HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, …

...transitions

observations

o1 o2 o3 o4 o5 o6 o7 o8

Generates:

State sequenceObservation sequence

Usually a multinomial over atomic, fixed alphabet

Page 15: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

IE with Hidden Markov Models

Yesterday Rich Caruana spoke this example sentence.

Yesterday Rich Caruana spoke this example sentence.

Person name: Rich Caruana

Given a sequence of observations:

and a trained HMM:

Find the most likely state sequence: (Viterbi)

Any words said to be generated by the designated “person name”state extract as a person name:

person name

location name

background

Page 16: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

(Linear Chain) Conditional Random Fields

yt - 1

yt

xt

yt+1

xt +1

xt -1

Finite state model Graphical model

Undirected graphical model,

trained to maximize conditional probability of outputs given inputs

. . .

FSM states

observations

yt+2

xt +2

yt+3

xt +3

said Veght a Microsoft VP …

p(y | x) =1

Z(x)Φy (y t ,y t−1)Φxy (x t ,y t )

t=1

T

Φ(⋅) = exp λ k fk (⋅)k

∑ ⎛

⎝ ⎜

⎠ ⎟where

OTHER PERSON OTHER ORG TITLE …

output seq

input seq

Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]

Fast-growing, wide-spread interest, many positive experimental results.

Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…

[Lafferty, McCallum, Pereira 2001]

Page 17: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Table Extraction from Government ReportsCash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 --------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.

Page 18: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Table Extraction from Government Reports

Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was

slightly below 1994. Producer returns averaged $12.93 per hundredweight,

$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,

1 percent above 1994. Marketings include whole milk sold to plants and dealers

as well as milk sold directly to consumers.

An estimated 1.56 billion pounds of milk were used on farms where produced,

8 percent less than 1994. Calves were fed 78 percent of this milk with the

remainder consumed in producer households.

Milk Cows and Production of Milk and Milkfat:

United States, 1993-95

--------------------------------------------------------------------------------

: : Production of Milk and Milkfat 2/

: Number :-------------------------------------------------------

Year : of : Per Milk Cow : Percentage : Total

:Milk Cows 1/:-------------------: of Fat in All :------------------

: : Milk : Milkfat : Milk Produced : Milk : Milkfat

--------------------------------------------------------------------------------

: 1,000 Head --- Pounds --- Percent Million Pounds

:

1993 : 9,589 15,704 575 3.66 150,582 5,514.4

1994 : 9,500 16,175 592 3.66 153,664 5,623.7

1995 : 9,461 16,451 602 3.66 155,644 5,694.3

--------------------------------------------------------------------------------

1/ Average number during year, excluding heifers not yet fresh.

2/ Excludes milk sucked by calves.

CRFLabels:• Non-Table• Table Title• Table Header• Table Data Row• Table Section Data Row• Table Footnote• ... (12 in all)

[Pinto, McCallum, Wei, Croft, 2003 SIGIR]

Features:• Percentage of digit chars• Percentage of alpha chars• Indented• Contains 5+ consecutive spaces• Whitespace in this line aligns with prev.• ...• Conjunctions of all previous features,

time offset: {0,0}, {-1,0}, {0,1}, {1,2}.

100+ documents from www.fedstats.gov

Page 19: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Table Extraction Experimental Results

Line labels,percent correct

Table segments,F1

95 % 92 %

65 % 64 %

85 % -

HMM

StatelessMaxEnt

CRF

[Pinto, McCallum, Wei, Croft, 2003 SIGIR]

Page 20: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

IE from Research Papers[McCallum et al ‘99]

Page 21: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

IE from Research Papers

Field-level F1

Hidden Markov Models (HMMs) 75.6[Seymore, McCallum, Rosenfeld, 1999]

Support Vector Machines (SVMs) 89.7[Han, Giles, et al, 2003]

Conditional Random Fields (CRFs) 93.9[Peng, McCallum, 2004]

error40%

Page 22: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Named Entity Recognition

CRICKET - MILLNS SIGNS FOR BOLAND

CAPE TOWN 1996-08-22

South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional.

Labels: Examples:

PER Yayuk BasukiInnocent Butare

ORG 3MKDPCleveland

LOC ClevelandNirmal HridayThe Oval

MISC JavaBasque1,000 Lakes Rally

Page 23: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Named Entity Extraction Results

Method F1

HMMs BBN's Identifinder 73%

CRFs w/out Feature Induction 83%

CRFs with Feature Induction 90%based on LikelihoodGain

[McCallum & Li, 2003, CoNLL]

Page 24: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Outline

• Examples of IE and Data Mining.

• Brief review of Conditional Random Fields

• Joint inference: Motivation and examples

– Joint Labeling of Cascaded Sequences (Belief Propagation)

– Joint Labeling of Distant Entities (BP by Tree Reparameterization)

– Joint Co-reference Resolution (Graph Partitioning)

– Joint Segmentation and Co-ref (Iterated Conditional Samples)

• Piecewise Training for large multi-component models

• Two example projects

– Email, contact management, and Social Network Analysis

– Research Paper search and analysis

Page 25: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Larger Context

SegmentClassifyAssociateCluster

Filter

Prediction Outlier detection Decision support

IE

Documentcollection

Database

Discover patterns - entity types - links / relations - events

DataMining

Spider

Actionableknowledge

Page 26: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Problem:

Combined in serial juxtaposition,IE and DM are unaware of each others’ weaknesses and opportunities.

1) DM begins from a populated DB, unaware of where the data came from, or its inherent uncertainties.

2) IE is unaware of emerging patterns and regularities in the DB.

The accuracy of both suffers, and significant mining of complex text sources is beyond reach.

SegmentClassifyAssociateCluster

IE

Documentcollection

Database

Discover patterns - entity types - links / relations - events

KnowledgeDiscovery

Actionableknowledge

Page 27: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

SegmentClassifyAssociateCluster

Filter

Prediction Outlier detection Decision support

IE

Documentcollection

Database

Discover patterns - entity types - links / relations - events

DataMining

Spider

Actionableknowledge

Uncertainty Info

Emerging Patterns

Solution:

Page 28: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

SegmentClassifyAssociateCluster

Filter

Prediction Outlier detection Decision support

IE

Documentcollection

ProbabilisticModel

Discover patterns - entity types - links / relations - events

DataMining

Spider

Actionableknowledge

Solution:

Conditional Random Fields [Lafferty, McCallum, Pereira]

Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…]

Discriminatively-trained undirected graphical models

Complex Inference and LearningJust what we researchers like to sink our teeth into!

Unified Model

Page 29: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Larger-scale Joint Inference for IE

• What model structures will capture salient dependencies?

• Will joint inference improve accuracy?

• How do to inference in these large graphical models?

• How to efficiently train these models,which are built from multiple large components?

Page 30: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

1. Jointly labeling cascaded sequencesFactorial CRFs

Part-of-speech

Noun-phrase boundaries

Named-entity tag

English words

[Sutton, Khashayar, McCallum, ICML 2004]

Page 31: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

1. Jointly labeling cascaded sequencesFactorial CRFs

Part-of-speech

Noun-phrase boundaries

Named-entity tag

English words

[Sutton, Khashayar, McCallum, ICML 2004]

Page 32: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

1. Jointly labeling cascaded sequencesFactorial CRFs

Part-of-speech

Noun-phrase boundaries

Named-entity tag

English words

[Sutton, Khashayar, McCallum, ICML 2004]

But errors cascade--must be perfect at every stage to do well.

Page 33: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

1. Jointly labeling cascaded sequencesFactorial CRFs

Part-of-speech

Noun-phrase boundaries

Named-entity tag

English words

[Sutton, Khashayar, McCallum, ICML 2004]

Joint prediction of part-of-speech and noun-phrase in newswire,matching accuracy with only 50% of the training data.

Inference:Tree reparameterization BP

[Wainwright et al, 2002]

Page 34: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

2. Jointly labeling distant mentionsSkip-chain CRFs

Senator Joe Green said today … . Green ran for …

[Sutton, McCallum, SRL 2004]

Dependency among similar, distant mentions ignored.

Page 35: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

2. Jointly labeling distant mentionsSkip-chain CRFs

Senator Joe Green said today … . Green ran for …

[Sutton, McCallum, SRL 2004]

14% reduction in error on most repeated field in email seminar announcements.

Inference:Tree reparameterization BP

[Wainwright et al, 2002]

Page 36: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

3. Joint co-reference among all pairsAffinity Matrix CRF

. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

45

99Y/N

Y/N

Y/N

11

[McCallum, Wellner, IJCAI WS 2003, NIPS 2004]

~25% reduction in error on co-reference ofproper nouns in newswire.

Inference:Correlational clusteringgraph partitioning

[Bansal, Blum, Chawla, 2002]

“Entity resolution”“Object correspondence”

Page 37: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Coreference Resolution

Input

AKA "record linkage", "database record deduplication", "entity resolution", "object correspondence", "identity uncertainty"

Output

News article, with named-entity "mentions" tagged

Number of entities, N = 3

#1 Secretary of State Colin Powell he Mr. Powell Powell

#2 Condoleezza Rice she Rice

#3 President Bush Bush

Today Secretary of State Colin Powell met with . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . he . . . . . .. . . . . . . . . . . . . Condoleezza Rice . . . . .. . . . Mr Powell . . . . . . . . . .she . . . . . . . . . . . . . . . . . . . . . Powell . . . . . . . . . . . .. . . President Bush . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . Rice . . . . . . . . . . . . . . . . Bush . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

Page 38: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Inside the Traditional Solution

Mention (3) Mention (4)

. . . Mr Powell . . . . . . Powell . . .

N Two words in common 29Y One word in common 13Y "Normalized" mentions are string identical 39Y Capitalized word in common 17Y > 50% character tri-gram overlap 19N < 25% character tri-gram overlap -34Y In same sentence 9Y Within two sentences 8N Further than 3 sentences apart -1Y "Hobbs Distance" < 3 11N Number of entities in between two mentions = 0 12N Number of entities in between two mentions > 4 -3Y Font matches 1Y Default -19

OVERALL SCORE = 98 > threshold=0

Pair-wise Affinity Metric

Y/N?

Page 39: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

The Problem

. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

affinity = 98

affinity = 11

affinity = 104

Pair-wise mergingdecisions are beingmade independentlyfrom each otherY

Y

N

Affinity measures are noisy and imperfect.

They should be madein relational dependencewith each other.

Page 40: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

A Markov Random Field for Co-reference

. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

45

30Y/N

Y/N

Y/N

[McCallum & Wellner, 2003, ICML](MRF)

Make pair-wise mergingdecisions in dependent relation to each other by- calculating a joint prob.- including all edge weights- adding dependence on consistent triangles.

11

P(v y |

v x ) =

1

Z v x

exp λ l f l (x i, x j , y ij ) + λ ' f '(y ij , y jk,y ik )i, j,k

∑l

∑i, j

∑ ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟

Page 41: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

A Markov Random Field for Co-reference

. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

45

30Y/N

Y/N

Y/N

[McCallum & Wellner, 2003](MRF)

Make pair-wise mergingdecisions in dependent relation to each other by- calculating a joint prob.- including all edge weights- adding dependence on consistent triangles.

11 ∞−

P(v y |

v x ) =

1

Z v x

exp λ l f l (x i, x j , y ij ) + λ ' f '(y ij , y jk,y ik )i, j,k

∑l

∑i, j

∑ ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟

Page 42: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

A Markov Random Field for Co-reference

. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

Y

N

N

[McCallum & Wellner, 2003]

P(v y |

v x ) =

1

Z v x

exp λ l f l (x i, x j , y ij ) + λ ' f '(y ij , y jk,y ik )i, j,k

∑l

∑i, j

∑ ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟

(MRF)

4

45)

30)

(11)

Page 43: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

A Markov Random Field for Co-reference

. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

Y

Y

N

[McCallum & Wellner, 2003]

P(v y |

v x ) =

1

Z v x

exp λ l f l (x i, x j , y ij ) + λ ' f '(y ij , y jk,y ik )i, j,k

∑l

∑i, j

∑ ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟

(MRF)

infinity

45)

30)

(11)

Page 44: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

A Markov Random Field for Co-reference

. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

N

Y

N

[McCallum & Wellner, 2003]

P(v y |

v x ) =

1

Z v x

exp λ l f l (x i, x j , y ij ) + λ ' f '(y ij , y jk,y ik )i, j,k

∑l

∑i, j

∑ ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟

(MRF)

64

45)

30)

(11)

Page 45: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Inference in these MRFs = Graph Partitioning[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]

. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

45

11

30

. . . Condoleezza Rice . . .

134

10

log P(v y |

v x )( )∝ λ l f l (x i,x j , y ij )

l

∑i, j

∑ = w ij

i, j w/inparitions

∑ − w ij

i, j acrossparitions

106

Page 46: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Inference in these MRFs = Graph Partitioning[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]

. . . Mr Powell . . .

. . . Powell . . .

. . . she . . . . . . Condoleezza Rice . . .

= 22

45

11

30

134

10

106

log P(v y |

v x )( )∝ λ l f l (x i,x j , y ij )

l

∑i, j

∑ = w ij

i, j w/inparitions

∑ − w ij

i, j acrossparitions

Page 47: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Inference in these MRFs = Graph Partitioning[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]

. . . Mr Powell . . .

. . . Powell . . .

. . . she . . . . . . Condoleezza Rice . . .

log P(v y |

v x )( )∝ λ l f l (x i, x j , y ij )

l

∑i, j

∑ = w ij

i, j w/inparitions

∑ + w'iji, j acrossparitions

∑ = 314

45

11

30

134

10

106

Page 48: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Co-reference Experimental Results

Proper noun co-reference

DARPA ACE broadcast news transcripts, 117 stories

Partition F1 Pair F1Single-link threshold 16 % 18 %Best prev match [Morton] 83 % 89 %MRFs 88 % 92 %

error=30% error=28%

DARPA MUC-6 newswire article corpus, 30 stories

Partition F1 Pair F1Single-link threshold 11% 7 %Best prev match [Morton] 70 % 76 %MRFs 74 % 80 %

error=13% error=17%

[McCallum & Wellner, 2003]

Page 49: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Y/N

Y/N

Y/N

Y/N

Y/N

Y/N

Joint Co-reference of Multiple Fields

X. LiPredicting the Stock Market

X. LiPredicting the Stock Market

International Conference onKnowledge Discovery and Data Mining

[Culotta & McCallum 2005]

Xuerui LiPredict the Stock Market

KDD

KDD

Citations Venues

Page 50: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Joint Co-reference Experimental Results

CiteSeer Dataset1500 citations, 900 unique papers, 350 unique venues

Paper Venueindep joint indep joint

constraint 88.9 91.0 79.4 94.1reinforce 92.2 92.2 56.5 60.1face 88.2 93.7 80.9 82.8reason 97.4 97.0 75.6 79.5Micro Average 91.7 93.4 73.1 79.1

error=20% error=22%

[Culotta & McCallum 2005]

Page 51: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Joint co-reference among all pairsAffinity Matrix CRF

. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

45

99Y/N

Y/N

Y/N

11

[McCallum, Wellner, IJCAI WS 2003, NIPS 2004]

~25% reduction in error on co-reference ofproper nouns in newswire.

Inference:Correlational clusteringgraph partitioning

[Bansal, Blum, Chawla, 2002]

Page 52: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

p

Databasefield values

c

4. Joint segmentation and co-reference

o

s

o

s

c

c

s

o

Citation attributes

y y

y

Segmentation

[Wellner, McCallum, Peng, Hay, UAI 2004]Inference:Variant of Iterated Conditional Modes

Co-reference decisions

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990.

Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990.

[Besag, 1986]

World Knowledge

35% reduction in co-reference error by using segmentation uncertainty.

6-14% reduction in segmentation error by using co-reference.

Extraction from and matching of research paper citations.

see also [Marthi, Milch, Russell, 2003]

Page 53: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Joint IE and Coreference from Research Paper Citations

Textual citation mentions(noisy, with duplicates)

Paper database, with fields,clean, duplicates collapsed

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

AUTHORS TITLE VENUECowell, Dawid… Probab… SpringerMontemerlo, Thrun…FastSLAM… AAAI…Kjaerulff Approxi… Technic…

4. Joint segmentation and co-reference

Page 54: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Laurel, B. Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , T. Smith (ed) ,

Addison-Wesley , 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in

Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

Citation Segmentation and Coreference

Page 55: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Laurel, B. Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , T. Smith (ed) ,

Addison-Wesley , 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in

Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

1) Segment citation fields

Citation Segmentation and Coreference

Page 56: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Laurel, B. Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , T. Smith (ed) ,

Addison-Wesley , 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in

Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

1) Segment citation fields

2) Resolve coreferent citations

Citation Segmentation and Coreference

Y?N

Page 57: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Laurel, B. Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , T. Smith (ed) ,

Addison-Wesley , 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in

Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

1) Segment citation fields

2) Resolve coreferent citations

3) Form canonical database record

Citation Segmentation and Coreference

AUTHOR = Brenda Laurel TITLE = Interface Agents: Metaphors with CharacterPAGES = 355-366BOOKTITLE = The Art of Human-Computer Interface DesignEDITOR = T. SmithPUBLISHER = Addison-WesleyYEAR = 1990

Y?N

Resolving conflicts

Page 58: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Laurel, B. Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , T. Smith (ed) ,

Addison-Wesley , 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in

Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

1) Segment citation fields

2) Resolve coreferent citations

3) Form canonical database record

Citation Segmentation and Coreference

AUTHOR = Brenda Laurel TITLE = Interface Agents: Metaphors with CharacterPAGES = 355-366BOOKTITLE = The Art of Human-Computer Interface DesignEDITOR = T. SmithPUBLISHER = Addison-WesleyYEAR = 1990

Y?N

Perform jointly.

Page 59: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

x

s

Observed citation

CRF Segmentation

IE + Coreference Model

J Besag 1986 On the…

AUT AUT YR TITL TITL

Page 60: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

x

s

Observed citation

CRF Segmentation

IE + Coreference Model

Citation mention attributes

J Besag 1986 On the…

AUTHOR = “J Besag”YEAR = “1986”TITLE = “On the…”

c

Page 61: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

x

s

IE + Coreference Model

c

J Besag 1986 On the…Smyth . 2001 Data Mining…

Smyth , P Data mining…

Structure for each citation mention

Page 62: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

x

s

IE + Coreference Model

c

Binary coreference variablesfor each pair of mentions

J Besag 1986 On the…Smyth . 2001 Data Mining…

Smyth , P Data mining…

Page 63: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

x

s

IE + Coreference Model

c

y n

n

J Besag 1986 On the…Smyth . 2001 Data Mining…

Smyth , P Data mining…

Binary coreference variablesfor each pair of mentions

Page 64: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

y n

n

x

s

IE + Coreference Model

c

J Besag 1986 On the…Smyth . 2001 Data Mining…

Smyth , P Data mining…

Research paper entity attribute nodes

AUTHOR = “P Smyth”YEAR = “2001”TITLE = “Data Mining…”...

Page 65: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Such a highly connected graph makes exact inference intractable…

Page 66: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

y n

n

Parameter Estimation

Coref graph edge weightsMAP on individual edges

Separately for different regions

IE Linear-chainExact MAP

Entity attribute potentialsMAP, pseudo-likelihood

In all cases:Climb MAP gradient with

quasi-Newton method

Page 67: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Outline

• Examples of IE and Data Mining.

• Brief review of Conditional Random Fields

• Joint inference: Motivation and examples

– Joint Labeling of Cascaded Sequences (Belief Propagation)

– Joint Labeling of Distant Entities (BP by Tree Reparameterization)

– Joint Co-reference Resolution (Graph Partitioning)

– Joint Segmentation and Co-ref (Iterated Conditional Samples)

• Piecewise Training for large multi-component models

• Two example projects

– Email, contact management, and Social Network Analysis

– Research Paper search and analysis

Page 68: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Piecewise Training

Page 69: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Piecewise Training with NOTA

Page 70: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Experimental Results

Named entity tagging (CoNLL-2003)Training set = 15k newswire sentences9 labels

Test F1 Training time

CRF 89.87 9 hours

MEMM 88.90 1 hour

CRF-PT 5.3 hours90.50stat. sig. improvement atp = 0.001

Page 71: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Experimental Results 2

Part-of-speech tagging (Penn Treebank, small subset)Training set = 1154 newswire sentences45 labels

Test F1 Training time

CRF 88.1 14 hours

MEMM 88.1 2 hours

CRF-PT 2.5 hours88.8stat. sig. improvement atp = 0.001

Page 72: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

“Parameter Independence Diagrams”

Graphical models = formalism for representing independence assumptions among variables.

Here we representindependence assumptions among parameters (in factor graph)

Page 73: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Piecewise Training Research Questions

• How to select the boundaries of “pieces”?• What choices of limited interaction are best?• How to sample sparse subsets of NOTA instances?

• Application to simpler models (classifiers)• Application to more complex models (parsing)

Page 74: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Piecewise Training in Factorial CRFsfor Transfer Learning

Emailed seminar ann’mt entities

Email English words

[Sutton, McCallum, 2005]

Too little labeled training data.

60k words training. GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

Page 75: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Piecewise Training in Factorial CRFsfor Transfer Learning

Newswire named entities

Newswire English words

[Sutton, McCallum, 2005]

Train on “related” task with more data.

200k words training.

CRICKET - MILLNS SIGNS FOR BOLAND

CAPE TOWN 1996-08-22

South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional.

Page 76: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Piecewise Training in Factorial CRFsfor Transfer Learning

Newswire named entities

Email English words

[Sutton, McCallum, 2005]

At test time, label email with newswire NEs...

Page 77: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Piecewise Training in Factorial CRFsfor Transfer Learning

Newswire named entities

Emailed seminar ann’mt entities

Email English words

[Sutton, McCallum, 2005]

…then use these labels as features for final task

Page 78: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Piecewise Training in Factorial CRFsfor Transfer Learning

Newswire named entities

Seminar Announcement entities

English words

[Sutton, McCallum, 2005]

Piecewise training of a joint model.

Page 79: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

CRF Transfer Experimental Results

Seminar Announcements Dataset [Freitag 1998]

CRF stime etime location speaker overall

No transfer 99.1 97.3 81.0 73.7 87.8

Cascaded transfer 99.2 96.0 84.3 74.2 88.4

Joint transfer 99.1 96.0 85.3 76.3 89.2

New “best published”accuracy on commondataset

[Sutton, McCallum, 2005]

Page 80: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Outline

• Examples of IE and Data Mining.

• Brief review of Conditional Random Fields

• Joint inference: Motivation and examples

– Joint Labeling of Cascaded Sequences (Belief Propagation)

– Joint Labeling of Distant Entities (BP by Tree Reparameterization)

– Joint Co-reference Resolution (Graph Partitioning)

– Joint Segmentation and Co-ref (Iterated Conditional Samples)

• Piecewise Training for large multi-component models

• Two example projects

– Email, contact management, and Social Network Analysis

– Research Paper search and analysis

Page 81: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Workplace effectiveness ~ Ability to leverage network of acquaintances

But filling Contacts DB by hand is tedious, and incomplete.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Email Inbox Contacts DB

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

WWW

Automatically

Managing and UnderstandingConnections of People in our Email World

Page 82: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

System Overview

ContactInfo andPerson Name

Extraction

Person Name

Extraction

NameCoreference

HomepageRetrieval

Social NetworkAnalysis

KeywordExtraction

CRFWWW

names

Email QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Page 83: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

An ExampleTo: “Andrew McCallum” [email protected]

Subject ...

First Name:

Andrew

Middle Name:

Kachites

Last Name:

McCallum

JobTitle: Associate Professor

Company: University of Massachusetts

Street Address:

140 Governor’s Dr.

City: Amherst

State: MA

Zip: 01003

Company Phone:

(413) 545-1323

Links: Fernando Pereira, Sam Roweis,…

Key Words:

Information extraction,

social network,…

Search for new people

Page 84: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Summary of Results

Token

Acc

Field

Prec

Field

Recall

Field

F1

CRF 94.50 85.73 76.33 80.76

Person Keywords

William Cohen Logic programming

Text categorization

Data integration

Rule learning

Daphne Koller Bayesian networks

Relational models

Probabilistic models

Hidden variables

Deborah McGuiness

Semantic web

Description logics

Knowledge representation

Ontologies

Tom Mitchell Machine learning

Cognitive states

Learning apprentice

Artificial intelligence

Contact info and name extraction performance (25 fields)

Example keywords extracted

1. Expert Finding: When solving some task, find friends-of-friends with relevant expertise. Avoid “stove-piping” in large org’s by automatically suggesting collaborators. Given a task, automatically suggest the right team for the job. (Hiring aid!)

2. Social Network Analysis: Understand the social structure of your organization. Suggest structural changes for improved efficiency.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 85: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

From LDA to Author-Recipient-Topic(ART)

Page 86: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Inference and Estimation

Gibbs Sampling:- Easy to implement- Reasonably fast

r

Page 87: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Enron Email Corpus

• 250k email messages• 23k people

Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT)From: [email protected]: [email protected]: Enron/TransAltaContract dated Jan 1, 2001

Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions.

DP

Debra PerlingiereEnron North America Corp.Legal Department1400 Smith Street, EB 3885Houston, Texas [email protected]

Page 88: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Topics, and prominent sender/receiversdiscovered by ART

Page 89: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Topics, and prominent sender/receiversdiscovered by ART

Beck = “Chief Operations Officer”Dasovich = “Government Relations Executive”Shapiro = “Vice Presidence of Regulatory Affairs”Steffes = “Vice President of Government Affairs”

Page 90: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Comparing Role Discovery

connection strength (A,B) =

distribution overauthored topics

Traditional SNA

distribution overrecipients

distribution overauthored topics

Author-TopicART

Page 91: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Comparing Role Discovery Tracy Geaconne Dan McCarty

Traditional SNA Author-TopicART

Similar roles Different rolesDifferent roles

Geaconne = “Secretary”McCarty = “Vice President”

Page 92: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Traditional SNA Author-TopicART

Different roles Very similarNot very similar

Geaconne = “Secretary”Hayslett = “Vice President & CTO”

Comparing Role Discovery Tracy Geaconne Rod Hayslett

Page 93: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Traditional SNA Author-TopicART

Different roles Very differentVery similar

Blair = “Gas pipeline logistics”Watson = “Pipeline facilities planning”

Comparing Role Discovery Lynn Blair Kimberly Watson

Page 94: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

McCallum Email Corpus 2004

• January - October 2004• 23k email messages• 825 people

From: [email protected]: NIPS and ....Date: June 14, 2004 2:27:41 PM EDTTo: [email protected]

There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for:

NIPS registration receipt.CALO registration receipt.

Thanks,Kate

Page 95: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

McCallum Email Blockstructure

Page 96: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Four most prominent topicsin discussions with ____?

Page 97: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst
Page 98: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Two most prominent topicsin discussions with ____?

Words Problove 0.030514house 0.015402

0.013659time 0.012351great 0.011334hope 0.011043dinner 0.00959saturday 0.009154left 0.009154ll 0.009009

0.008282visit 0.008137evening 0.008137stay 0.007847bring 0.007701weekend 0.007411road 0.00712sunday 0.006829kids 0.006539flight 0.006539

Words Probtoday 0.051152tomorrow 0.045393time 0.041289ll 0.039145meeting 0.033877week 0.025484talk 0.024626meet 0.023279morning 0.022789monday 0.020767back 0.019358call 0.016418free 0.015621home 0.013967won 0.013783day 0.01311hope 0.012987leave 0.012987office 0.012742tuesday 0.012558

Page 99: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Topic 40

Words Prob Sender Recipient Probcode 0.060565 hough mccallum 0.067076mallet 0.042015 mccallum hough 0.041032files 0.029115 mikem mccallum 0.028501al 0.024201 culotta mccallum 0.026044file 0.023587 saunders mccallum 0.021376version 0.022113 mccallum saunders 0.019656java 0.021499 pereira mccallum 0.017813test 0.020025 casutton mccallum 0.017199problem 0.018305 mccallum ronb 0.013514run 0.015356 mccallum pereira 0.013145cvs 0.013391 hough melinda.gervasio 0.013022add 0.012776 mccallum casutton 0.013022directory 0.012408 fuchun mccallum 0.010811release 0.012285 mccallum culotta 0.009828output 0.011916 ronb mccallum 0.009705bug 0.011179 westy hough 0.009214source 0.010197 xuerui corrada 0.0086ps 0.009705 ghuang mccallum 0.008354log 0.008968 khash mccallum 0.008231created 0.0086 melinda.gervasiomccallum 0.008108

Page 100: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst
Page 101: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Role-Author-Recipient-Topic Models

Page 102: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Outline

• Examples of IE and Data Mining.

• Brief review of Conditional Random Fields

• Joint inference: Motivation and examples

– Joint Labeling of Cascaded Sequences (Belief Propagation)

– Joint Labeling of Distant Entities (BP by Tree Reparameterization)

– Joint Co-reference Resolution (Graph Partitioning)

– Joint Segmentation and Co-ref (Iterated Conditional Samples)

• Piecewise Training for large multi-component models

• Two example projects

– Email, contact management, and Social Network Analysis

– Research Paper search and analysis

Page 103: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Previous Systems

Page 104: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 105: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

ResearchPaper

Cites

Previous Systems

Page 106: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

ResearchPaper

Cites

Person

UniversityVenue

Grant

Groups

Expertise

More Entities and Relations

Page 107: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst
Page 108: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst
Page 109: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst
Page 110: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst
Page 111: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst
Page 112: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst
Page 113: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst
Page 114: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

Summary

• Conditional Random Fields– Conditional probability models of structured data

• Data mining complex unstructured text suggests the need for joint inference IE + DM.

• Early examples– Factorial finite state models– Jointly labeling distant entities– Coreference analysis– Segmentation uncertainty aiding coreference

• Piecewise Training– Faster + higher accuracy

• Current projects– Email, contact management, expert-finding, SNA– Mining the scientific literature

Page 115: Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst

End of Talk