Probabilistic Models of Relational Data Daphne Koller Stanford University Joint work with: Lise Getoor Ming-Fai Wong Eran Segal Avi Pfeffer Pieter Abbeel

Probabilistic Models of Relational Data

Daphne KollerStanford University

Joint work with:

Lise Getoor

Ming-Fai Wong

Eran Segal

Avi Pfeffer

Pieter Abbeel

Nir Friedman

Ben Taskar

Why Relational?

The real world is composed of objects that have properties and are related to each other

Natural language is all about objects and how they relate to each other “George got an A in Geography 101”

Attribute-Based Worlds

Smart students get A’s in easy classes

Smart_Jane & easy_CS101 GetA_Jane_CS101 Smart_Mike & easy_Geo101 GetA_Mike_Geo101 Smart_Jane & easy_Geo101 GetA_Jane_Geo101 Smart_Rick & easy_CS221 GetA_Rick_C

World = assignment of values to attributes / truth values to propositional symbols

Object-Relational Worlds

World = relational interpretation: Objects in the domain Properties of these objects Relations (links) between objects

x,y(Smart(x) & Easy(y) & Take(x,y) Grade(A,x,y))

Why Probabilities?

All universals are false Smart students get A’s in easy classes

True universals are rarely useful Smart students get either A, B, C, D,

or F C student

The actual science of logic is conversant at present only The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful …with things either certain, impossible, or entirely doubtful …

(almost)

James Clerk MaxwellJames Clerk Maxwell

Therefore the true logic for this world is the calculusTherefore the true logic for this world is the calculusof probabilities …of probabilities …

Probable Worlds Probabilistic semantics:

A set of possible worlds Each world associated with a probability

hardsmart

A

hardsmart

B

hardsmart

C

hardweak

A

hardweak

B

hardweak

C

easysmart

A

easysmart

B

easysmart

C

easyweak

A

easyweak

B

easyweak

C

course difficultystudent intell.grade

Representation: Design Axes

Attributes Objects

Cat

ego

rica

lP

rob

abil

isti

cE

pis

tem

ic s

tate

World state

Propositional logic Propositional logic CSPsCSPs

First-order logicFirst-order logic

Relational databasesRelational databases

Sequences

AutomataAutomataGrammarsGrammars

Bayesian netsBayesian netsMarkov netsMarkov nets

n-gram modelsn-gram modelsHMMsHMMs

Prob. CFGsProb. CFGs

Outline

Bayesian Networks Representation & Semantics Reasoning

Probabilistic Relational Models Collective Classification Undirected discriminative models Collective Classification Revisited PRMs for NLP

Bayesian Networks

nodes = variablesedges = direct influence

Graph structure encodes independence assumptions: Letter conditionally independent of Intelligence given Grade

0% 20% 40% 60% 80% 100%

hard,high

hard,low

easy,high

easy,lowA B C

CPD P(G|D,I)

Letter

Grade

SAT

IntelligenceDifficulty

BN semantics

Compact & natural representation: nodes have k parents 2kn vs. 2n params parameters natural and easy to elicit

conditionalindependenciesin BN structure

+local

probabilitymodels

full jointdistribution

over domain=

L

G

S

ID

i)|P(sg)|P(l

i)d,|P(gP(i)P(d)s)l,g,i,P(d,

Full joint distribution specifies answer to any query: P(variable | evidence about others)

Reasoning using BNs

Letter

Grade

SAT

IntelligenceDifficulty

Letter SAT

Probability theory is nothing butProbability theory is nothing butcommon sense reduced to calculation.common sense reduced to calculation.

Pierre Simon LaplacePierre Simon Laplace

BN Inference BN Inference is NP-hard Structure can use graph structure:

Graph separation conditional independence Do separate inference in parts Results combined over interface.

A

C

B

D

FE Complexity: exponential in largest separator

Structured BNs allow effective inference Exact inference in dense BNs is intractable

Approximate BN Inference Belief propagation is an iterative message

passing algorithm for approximate inference in BNs

Each iteration (until “convergence”): Nodes pass “beliefs” as messages to neighboring

nodes Cons:

Limited theoretical guarantees Might not converge

Pros: Linear time per iteration Works very well in practice, even for dense networks

Outline

Bayesian Networks Probabilistic Relational Models

Language & Semantics Web of Influence

Collective Classification Undirected discriminative models Collective Classification Revisited PRMs for NLP

Bayesian Networks: Problem

Bayesian nets use propositional representation Real world has objects, related to each other

Intelligence Difficulty

Grade

Intell_Jane Diffic_CS101

Grade_Jane_CS101

Intell_George Diffic_Geo101

Grade_George_Geo101

Intell_George Diffic_CS101

Grade_George_CS101A C

These “instances” are not independent

Probabilistic Relational Models

Combine advantages of relational logic & BNs: Natural domain modeling: objects, properties,

relations Generalization over a variety of situations Compact, natural probability models

Integrate uncertainty with relational model: Properties of domain entities can depend on

properties of related entities Uncertainty over relational structure of domain

St. Nordaf University

Tea

ches

Tea

ches

In-course

In-course

Registered

In-course

Prof. SmithProf. Jones

George

Jane

Welcome to

CS101

Welcome to

Geo101

Teaching-abilityTeaching-ability

Difficulty

Difficulty Registered

RegisteredGrade

Grade

Grade

Satisfac

Satisfac

Satisfac

Intelligence

Intelligence

Relational Schema Specifies types of objects in domain, attributes of

each type of object & types of relations between objects

Teach

Student

Intelligence

Registration

Grade

Satisfaction

Course

Difficulty

Professor

Teaching-Ability

In

Take

ClassesClasses

RelationsRelationsAttributesAttributes

Probabilistic Relational Models

Universals: Probabilistic patterns hold for all objects in class Locality: Represent direct probabilistic dependencies

Links define potential interactions

StudentIntelligence

RegGrade

Satisfaction

CourseDifficulty

ProfessorTeaching-Ability

[K. & Pfeffer; Poole; Ngo & Haddawy]

0% 20% 40% 60% 80% 100%

hard,high

hard,low

easy,high

easy,lowA B C

Prof. SmithProf. Jones

Welcome to

CS101

Welcome to

Geo101

PRM Semantics

Teaching-abilityTeaching-ability

Difficulty

Difficulty

Grade

Grade

Grade

Satisfac

Satisfac

Satisfac

Intelligence

Intelligence

Instantiated PRM BN variables: attributes of all objects dependencies: determined by links & PRM

George

Jane

Welcome to

CS101

low / high

The Web of Influence

0% 50% 100%0% 50% 100%

Welcome to

Geo101 A

C

low high

0% 50% 100%

easy / hard

Outline

Bayesian Networks Probabilistic Relational Models Collective Classification & Clustering

Learning models from data Collective classification of webpages

Undirected discriminative models Collective Classification Revisited PRMs for NLP

Learning PRMs

LearnerLearnerLearnerLearner

RelationalDatabase

Course Student

Reg

D

Expert knowledge

[Friedman, Getoor, K., Pfeffer]

Learning PRMs Parameter estimation:

Probabilistic model with shared parameters Grades for all students share same model

Can use standard techniques for max-likelihood or Bayesian parameter estimation

Structure learning: Define scoring function over structures Use combinatorial search to find high-scoring

structure

).,.*,.(#

).,.,.(#

).,.|.(ˆ

loDiffCoursehiIntellStudentGradeReg

loDiffCoursehiIntellStudentAGradeReg

loDiffCoursehiIntellStudentAGradeRegP

Web KB

Tom MitchellProfessor

WebKBProject

Sean SlatteryStudent

Advisor-of

Project-of

Member

[Craven et al.]

Web Classification Experiments

WebKB dataset Four CS department websites Bag of words on each page Links between pages Anchor text for links

Experimental setup Trained on three universities Tested on fourth Repeated for all four combinations

Professordepartment

extractinformationcomputersciencemachinelearning

…

Standard Classification

Categories:facultycourseprojectstudentother

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

words only

Naïve Bayes

Page

...

Category

Word1 WordN

Exploiting Links

... LinkWordN

workingwithTom Mitchell …

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

words only link words

Page

...

Category

Word1 WordN

Collective Classification

...

PageCategory

Word1 WordN

From-

...

PageCategory

Word1 WordN

Link

Exists

To-

[Getoor, Segal, Taskar, Koller]

Approx. inference: belief propagation 0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

words only link words collective

Classify all pages collectively,

maximizing the joint label probability

P(Registration.Grade | Course.Difficulty, Student.Intelligence)

0% 20% 40% 60% 80% 100%

hard,high

hard,low

easy,high

easy,low

Learning w. Missing Data: EM

0% 20% 40% 60% 80% 100%

hard,high

hard,low

easy,high

easy,low

0% 20% 40% 60% 80% 100%

hard,high

hard,low

easy,high

easy,low

0% 20% 40% 60% 80% 100%

hard,high

hard,low

easy,high

easy,low

0% 20% 40% 60% 80% 100%

hard,high

hard,low

easy,high

easy,low

low / higheasy / hard

A B C

CoursesStudents

[Dempster et al. 77]

Discovering Hidden Types

Internet Movie Databasehttp://www.imdb.com

Actor

Director

Movie

Genres Rating

Year#Votes

MPAA Rating


Type Type

Type

[Taskar, Segal, Koller]

Directors

Steven SpielbergTim BurtonTony ScottJames CameronJohn McTiernanJoel Schumacher

Alfred HitchcockStanley KubrickDavid LeanMilos FormanTerry GilliamFrancis Coppola

Actors

Anthony HopkinsRobert De NiroTommy Lee JonesHarvey KeitelMorgan FreemanGary Oldman

Sylvester StalloneBruce WillisHarrison FordSteven SeagalKurt RussellKevin CostnerJean-Claude Van DammeArnold Schwarzenegger

…

MoviesWizard of OzCinderellaSound of MusicThe Love BugPollyannaThe Parent TrapMary PoppinsSwiss Family Robinson

…

Terminator 2BatmanBatman ForeverGoldenEyeStarship TroopersMission: Impossible Hunt for Red October


Outline

Bayesian Networks Probabilistic Relational Models Collective Classification & Clustering Undirected Discriminative Models

Markov Networks Relational Markov Networks

Collective Classification Revisited PRMs for NLP

Directed Models: Limitations

Acyclicity constraint limits expressive power:

Two objects linked to by a student probably not both professors

Allow arbitrary patterns over sets of objects & links

Acyclicity forces modeling of all potential links:

Network size O(N2) Inference is quadratic

Generative training: Train to fit all of data, not

to maximize accuracy

Influence flows over existing links, exploiting link graph sparsity

Network size O(N)

Allow discriminative training: Max P (labels | observations)

Solution: Undirected Models

[Lafferty, McCallum, Pereira]

Markov Networks

Graph structure encodes independence assumptions: Chris conditionally independent of Eve given Alice & Dave

Chris Dave

EveAlice

A)(E,E)(D,D)(C,C)B,(A,E)D,C,B,P(A, Z

1

Betty

0 0.5 1 1.5 2

TTTTTFTFTTFFFTTFTFFFTFFF

ABC Compatibility (A,B,C)

Relational Markov Networks

Universals: Probabilistic patterns hold for all groups of objects

Locality: Represent local probabilistic dependencies Sets of links give us possible interactions

Study Group

Student2

Reg2GradeIntelligence

Course

RegGrade

Student

Difficulty

Intelligence

[Taskar, Abbeel, Koller ‘02]

0 0.5 1 1.5 2

AAABACBABBBCCACBCC

Template potential

RMN SemanticsInstantiated RMN MN variables: attributes of all objects dependencies: determined by links & RMN

George

Jane

Welcome to

CS101

Welcome to

Geo101

Difficulty

Difficulty

Jill

Geo Study Group

CS Study Group

Intelligence

Intelligence

Intelligence

Grade

Grade

Grade

Grade

Outline

Bayesian Networks Probabilistic Relational Models Collective Classification & Clustering Undirected Discriminative Models Collective Classification Revisited

Discriminative training of RMNs Webpage classification Link prediction

PRMs for NLP

Learning RMNs Parameter estimation is not closed form

Convex problem unique global maximum

(Reg1.Grade,Reg2.Grade)

0 0.5 1 1.5 2

AAABACBABBBCCACBCC

P(Grades,Intelligence|Difficulty)

0 0.5 1 1.5 2

AAABACBABBBCCACBCC

0 0.5 1 1.5 2

AAABACBABBBCCACBCC

Difficulty

Difficulty

Intelligence

Intelligence

Intelligence

Grade

Grade

Grade

Grade

low / higheasy / hard ABC

L = log

)|,(

),(#

DifficAGradeAGradeP

AGradeAGradeL

AA

Intelligence

Intelligence

Intelligence

Grade

Grade

Grade

Grade

Intelligence

Intelligence

Intelligence

Grade

Grade

Grade

Grade

Maximize

Flat Models

...

PageCategory

Word1 WordN LinkWordN...

P(Category|Words)

Logistic Regressio

n

0

0.05

0.1

0.15

0.2

0.25

0.3

Naïve Bayes Logistic SVM

Exploiting Links

...

PageCategory

Word1 WordN

From-

Link ...

PageCategory

Word1 WordN

To-

0

0.05

0.1

0.15

0.2

0.25

PRM Logistic RMN-link

42.1% relative reduction in error relative to generative approach

More Complex Structure

C

Wn

W1Faculty

S

Students

S

Courses

Collective Classification: Results

00.020.040.060.080.1

0.120.140.160.18

Logistic Links Section Link+Section

35.4% relative reduction in error relative to strong flat approach

Scalability

WebKB data set size 1300 entities 180K attributes 5800 links

Network size / school: Directed model

200,000 variables 360,000 edges

Undirected model 40,000 variables 44,000 edges

Difference in training time decreases substantially when

some training data is unobserved want to model with hidden variables

3 sec 180 sec

20 minutes 15-20 sec

Directedmodels

Undirectedmodels

Training Classification

Predicting Relationships

Even more interesting are the relationships between objects e.g., verbs are almost always relationships

Tom MitchellProfessor

WebKBProject

Sean SlatteryStudent

Advisor-of

Member

Member

Rel

Flat Model

...PageWord1 WordN

From- ...

PageWord1 WordN

To-

Type

...LinkWord1 LinkWordN

NONEadvisor

instructor

TAmemberproject-

of

Flat Model

...

......

...

...

...

Collective Classification: Links

Rel

...

Page

Word1 WordN

From-

...

Page

Word1 WordN

To-

Type

...LinkWord1 LinkWordN

Category Category

Link Model

...

......

...

...

...

Triad Model

Professor Student

Group

Advisor

MemberMember

Triad Model

Professor Student

Course

Advisor

TAInstructor

Triad Model

WebKB++ Four new department web sites:

Berkeley, CMU, MIT, Stanford Labeled page type (8 types):

faculty, student, research scientist, staff, research group, research project, course, organization

Labeled hyperlinks and virtual links (6 types): advisor, instructor, TA, member, project-of, NONE

Data set size: 11K pages 110K links 2million words

Link Prediction: Results

Error measured over links predicted to be present

Link presence cutoff is at precision/recall break-even point (30% for all models) 0

5

10

15

20

25

30

Flat Labels Triad

...

... ...72.9% relative reduction in error relative to strong flat approach

Summary

PRMs inherit key advantages of probabilistic graphical models: Coherent probabilistic semantics Exploit structure of local interactions

Relational models inherently more expressive

“Web of influence”: use all available information to reach powerful conclusions

Exploit both relational information and power of probabilistic reasoning

Outline

Bayesian Networks Probabilistic Relational Models Collective Classification & Clustering Undirected Discriminative Models Collective Classification Revisited PRMs for NLP

Word-Sense Disambiguation Relation Extraction Natural Language Understanding (?)

* An outsider’s perspective

or “Why Should I Care?”*

Her advisor gave her feedback about the draft.

Word Sense Disambiguation

Neighboring words alone may not provide enough information to disambiguate

We can gain insight by considering compatibility between senses of related words

financialacademic

physicalfigurative

electricalcriticism

windpaper

Collective Disambiguation

Objects: words in text Attributes: sense, gender, number, pos, … Links:

Grammatical relations (subject-object, modifier,…) Close semantic relations (is-a, cause-of, …) Same word in different sentences (one-sense-per-

discourse) Compatibility parameters:

Learned from tagged data Based on prior knowledge (e.g., WordNet, FrameNet)

Her advisor gave her feedback about the draft.

financialacademic

physicalfigurative

electricalcriticism

windpaper

Can we infer grammatical structure and disambiguate word senses

simultaneously rather than sequentially?

Can we integrate inter-word relationships directly into our

probabilistic model?

Relation Extraction

Announcement

MillerJackson

Made

Candidate

Concerns

DepartsCEO

Of

ACME’s board of directors began a search for a new CEO after the departure of current CEO, James Jackson, following allegations of creative accounting practices at ACME. [6/01] … In an attempt to improve the company’s image, ACME is considering former judge Mary Miller for the job. [7/01] … As her first act in her new position, Miller announced that ACME will be doing a stock buyback. [9/01] …

Hired??

Professor Sarah met Jane.She explained the hole in her proof.

Understanding Language

Proof:Theorem: P=NP

N=1

Most likely interpretation:

Student Jane Professor Sarah

Resolving Ambiguity

Professors often meet with students Jane is probably a student

Professors like to explain “She” is probably Prof. Sarah

Attribute values

Link types

Object identity

[Goldman & Charniak, Pasula & Russell]

Professor Sarah met Jane.She explained the hole in her proof.

Probabilistic reasoning about objects, their attributes, and the relationships between them

Acquiring Semantic Models Statistical NLP reveals patterns:

Standard models learn patterns at word level But word-patterns are only implicit surrogates for

underlying semantic patterns “Teacher” objects tend to participate in certain relationships Can use this pattern for objects not explicitly labeled as a

teacher

teacher

betrainhire

pay

fireserenade

24%3%3%1.5%

1.4%0.3%

Competing Approaches

Logical

Statistical

SemanticUnderstanding

Scaling Up(via learning)

PRMs

Noise &AmbiguityDesiderata:

Complementary Approaches

Statistics: from Words to Semantics

Represent statistical patterns at semantic level What types of objects participate in what types

of relationships

Learn statistical models of semantics from text

Reason using the models to obtain global semantic understanding of the text

Georgia O’KeefeLadder to the Moon

Documents

Probabilistic Models of Relational Data Daphne Koller Stanford University Joint work with: Lise Getoor Ming-Fai Wong Eran Segal Avi Pfeffer Pieter Abbeel