View
222
Download
2
Tags:
Embed Size (px)
Citation preview
Probabilistic Models of Relational Data
Daphne KollerStanford University
Joint work with:
Lise Getoor
Ming-Fai Wong
Eran Segal
Avi Pfeffer
Pieter Abbeel
Nir Friedman
Ben Taskar
Why Relational?
The real world is composed of objects that have properties and are related to each other
Natural language is all about objects and how they relate to each other “George got an A in Geography 101”
Attribute-Based Worlds
Smart students get A’s in easy classes
Smart_Jane & easy_CS101 GetA_Jane_CS101 Smart_Mike & easy_Geo101 GetA_Mike_Geo101 Smart_Jane & easy_Geo101 GetA_Jane_Geo101 Smart_Rick & easy_CS221 GetA_Rick_C
World = assignment of values to attributes / truth values to propositional symbols
Object-Relational Worlds
World = relational interpretation: Objects in the domain Properties of these objects Relations (links) between objects
x,y(Smart(x) & Easy(y) & Take(x,y) Grade(A,x,y))
Why Probabilities?
All universals are false Smart students get A’s in easy classes
True universals are rarely useful Smart students get either A, B, C, D,
or F C student
The actual science of logic is conversant at present only The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful …with things either certain, impossible, or entirely doubtful …
(almost)
James Clerk MaxwellJames Clerk Maxwell
Therefore the true logic for this world is the calculusTherefore the true logic for this world is the calculusof probabilities …of probabilities …
Probable Worlds Probabilistic semantics:
A set of possible worlds Each world associated with a probability
hardsmart
A
hardsmart
B
hardsmart
C
hardweak
A
hardweak
B
hardweak
C
easysmart
A
easysmart
B
easysmart
C
easyweak
A
easyweak
B
easyweak
C
course difficultystudent intell.grade
Representation: Design Axes
Attributes Objects
Cat
ego
rica
lP
rob
abil
isti
cE
pis
tem
ic s
tate
World state
Propositional logic Propositional logic CSPsCSPs
First-order logicFirst-order logic
Relational databasesRelational databases
Sequences
AutomataAutomataGrammarsGrammars
Bayesian netsBayesian netsMarkov netsMarkov nets
n-gram modelsn-gram modelsHMMsHMMs
Prob. CFGsProb. CFGs
Outline
Bayesian Networks Representation & Semantics Reasoning
Probabilistic Relational Models Collective Classification Undirected discriminative models Collective Classification Revisited PRMs for NLP
Bayesian Networks
nodes = variablesedges = direct influence
Graph structure encodes independence assumptions: Letter conditionally independent of Intelligence given Grade
0% 20% 40% 60% 80% 100%
hard,high
hard,low
easy,high
easy,lowA B C
CPD P(G|D,I)
Letter
Grade
SAT
IntelligenceDifficulty
BN semantics
Compact & natural representation: nodes have k parents 2kn vs. 2n params parameters natural and easy to elicit
conditionalindependenciesin BN structure
+local
probabilitymodels
full jointdistribution
over domain=
L
G
S
ID
i)|P(sg)|P(l
i)d,|P(gP(i)P(d)s)l,g,i,P(d,
Full joint distribution specifies answer to any query: P(variable | evidence about others)
Reasoning using BNs
Letter
Grade
SAT
IntelligenceDifficulty
Letter SAT
Probability theory is nothing butProbability theory is nothing butcommon sense reduced to calculation.common sense reduced to calculation.
Pierre Simon LaplacePierre Simon Laplace
BN Inference BN Inference is NP-hard Structure can use graph structure:
Graph separation conditional independence Do separate inference in parts Results combined over interface.
A
C
B
D
FE Complexity: exponential in largest separator
Structured BNs allow effective inference Exact inference in dense BNs is intractable
Approximate BN Inference Belief propagation is an iterative message
passing algorithm for approximate inference in BNs
Each iteration (until “convergence”): Nodes pass “beliefs” as messages to neighboring
nodes Cons:
Limited theoretical guarantees Might not converge
Pros: Linear time per iteration Works very well in practice, even for dense networks
Outline
Bayesian Networks Probabilistic Relational Models
Language & Semantics Web of Influence
Collective Classification Undirected discriminative models Collective Classification Revisited PRMs for NLP
Bayesian Networks: Problem
Bayesian nets use propositional representation Real world has objects, related to each other
Intelligence Difficulty
Grade
Intell_Jane Diffic_CS101
Grade_Jane_CS101
Intell_George Diffic_Geo101
Grade_George_Geo101
Intell_George Diffic_CS101
Grade_George_CS101A C
These “instances” are not independent
Probabilistic Relational Models
Combine advantages of relational logic & BNs: Natural domain modeling: objects, properties,
relations Generalization over a variety of situations Compact, natural probability models
Integrate uncertainty with relational model: Properties of domain entities can depend on
properties of related entities Uncertainty over relational structure of domain
St. Nordaf University
Tea
ches
Tea
ches
In-course
In-course
Registered
In-course
Prof. SmithProf. Jones
George
Jane
Welcome to
CS101
Welcome to
Geo101
Teaching-abilityTeaching-ability
Difficulty
Difficulty Registered
RegisteredGrade
Grade
Grade
Satisfac
Satisfac
Satisfac
Intelligence
Intelligence
Relational Schema Specifies types of objects in domain, attributes of
each type of object & types of relations between objects
Teach
Student
Intelligence
Registration
Grade
Satisfaction
Course
Difficulty
Professor
Teaching-Ability
In
Take
ClassesClasses
RelationsRelationsAttributesAttributes
Probabilistic Relational Models
Universals: Probabilistic patterns hold for all objects in class Locality: Represent direct probabilistic dependencies
Links define potential interactions
StudentIntelligence
RegGrade
Satisfaction
CourseDifficulty
ProfessorTeaching-Ability
[K. & Pfeffer; Poole; Ngo & Haddawy]
0% 20% 40% 60% 80% 100%
hard,high
hard,low
easy,high
easy,lowA B C
Prof. SmithProf. Jones
Welcome to
CS101
Welcome to
Geo101
PRM Semantics
Teaching-abilityTeaching-ability
Difficulty
Difficulty
Grade
Grade
Grade
Satisfac
Satisfac
Satisfac
Intelligence
Intelligence
Instantiated PRM BN variables: attributes of all objects dependencies: determined by links & PRM
George
Jane
Welcome to
CS101
low / high
The Web of Influence
0% 50% 100%0% 50% 100%
Welcome to
Geo101 A
C
low high
0% 50% 100%
easy / hard
Outline
Bayesian Networks Probabilistic Relational Models Collective Classification & Clustering
Learning models from data Collective classification of webpages
Undirected discriminative models Collective Classification Revisited PRMs for NLP
Learning PRMs
LearnerLearnerLearnerLearner
RelationalDatabase
Course Student
Reg
D
Expert knowledge
[Friedman, Getoor, K., Pfeffer]
Learning PRMs Parameter estimation:
Probabilistic model with shared parameters Grades for all students share same model
Can use standard techniques for max-likelihood or Bayesian parameter estimation
Structure learning: Define scoring function over structures Use combinatorial search to find high-scoring
structure
).,.*,.(#
).,.,.(#
).,.|.(ˆ
loDiffCoursehiIntellStudentGradeReg
loDiffCoursehiIntellStudentAGradeReg
loDiffCoursehiIntellStudentAGradeRegP
Web KB
Tom MitchellProfessor
WebKBProject
Sean SlatteryStudent
Advisor-of
Project-of
Member
[Craven et al.]
Web Classification Experiments
WebKB dataset Four CS department websites Bag of words on each page Links between pages Anchor text for links
Experimental setup Trained on three universities Tested on fourth Repeated for all four combinations
Professordepartment
extractinformationcomputersciencemachinelearning
…
Standard Classification
Categories:facultycourseprojectstudentother
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
words only
Naïve Bayes
Page
...
Category
Word1 WordN
Exploiting Links
... LinkWordN
workingwithTom Mitchell …
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
words only link words
Page
...
Category
Word1 WordN
Collective Classification
...
PageCategory
Word1 WordN
From-
...
PageCategory
Word1 WordN
Link
Exists
To-
[Getoor, Segal, Taskar, Koller]
Approx. inference: belief propagation 0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
words only link words collective
Classify all pages collectively,
maximizing the joint label probability
P(Registration.Grade | Course.Difficulty, Student.Intelligence)
0% 20% 40% 60% 80% 100%
hard,high
hard,low
easy,high
easy,low
Learning w. Missing Data: EM
0% 20% 40% 60% 80% 100%
hard,high
hard,low
easy,high
easy,low
0% 20% 40% 60% 80% 100%
hard,high
hard,low
easy,high
easy,low
0% 20% 40% 60% 80% 100%
hard,high
hard,low
easy,high
easy,low
0% 20% 40% 60% 80% 100%
hard,high
hard,low
easy,high
easy,low
low / higheasy / hard
A B C
CoursesStudents
[Dempster et al. 77]
Discovering Hidden Types
Internet Movie Databasehttp://www.imdb.com
Actor
Director
Movie
Genres Rating
Year#Votes
MPAA Rating
Discovering Hidden Types
Type Type
Type
[Taskar, Segal, Koller]
Directors
Steven SpielbergTim BurtonTony ScottJames CameronJohn McTiernanJoel Schumacher
Alfred HitchcockStanley KubrickDavid LeanMilos FormanTerry GilliamFrancis Coppola
Actors
Anthony HopkinsRobert De NiroTommy Lee JonesHarvey KeitelMorgan FreemanGary Oldman
Sylvester StalloneBruce WillisHarrison FordSteven SeagalKurt RussellKevin CostnerJean-Claude Van DammeArnold Schwarzenegger
…
MoviesWizard of OzCinderellaSound of MusicThe Love BugPollyannaThe Parent TrapMary PoppinsSwiss Family Robinson
…
Terminator 2BatmanBatman ForeverGoldenEyeStarship TroopersMission: Impossible Hunt for Red October
Discovering Hidden Types
Outline
Bayesian Networks Probabilistic Relational Models Collective Classification & Clustering Undirected Discriminative Models
Markov Networks Relational Markov Networks
Collective Classification Revisited PRMs for NLP
Directed Models: Limitations
Acyclicity constraint limits expressive power:
Two objects linked to by a student probably not both professors
Allow arbitrary patterns over sets of objects & links
Acyclicity forces modeling of all potential links:
Network size O(N2) Inference is quadratic
Generative training: Train to fit all of data, not
to maximize accuracy
Influence flows over existing links, exploiting link graph sparsity
Network size O(N)
Allow discriminative training: Max P (labels | observations)
Solution: Undirected Models
[Lafferty, McCallum, Pereira]
Markov Networks
Graph structure encodes independence assumptions: Chris conditionally independent of Eve given Alice & Dave
Chris Dave
EveAlice
A)(E,E)(D,D)(C,C)B,(A,E)D,C,B,P(A, Z
1
Betty
0 0.5 1 1.5 2
TTTTTFTFTTFFFTTFTFFFTFFF
ABC Compatibility (A,B,C)
Relational Markov Networks
Universals: Probabilistic patterns hold for all groups of objects
Locality: Represent local probabilistic dependencies Sets of links give us possible interactions
Study Group
Student2
Reg2GradeIntelligence
Course
RegGrade
Student
Difficulty
Intelligence
[Taskar, Abbeel, Koller ‘02]
0 0.5 1 1.5 2
AAABACBABBBCCACBCC
Template potential
RMN SemanticsInstantiated RMN MN variables: attributes of all objects dependencies: determined by links & RMN
George
Jane
Welcome to
CS101
Welcome to
Geo101
Difficulty
Difficulty
Jill
Geo Study Group
CS Study Group
Intelligence
Intelligence
Intelligence
Grade
Grade
Grade
Grade
Outline
Bayesian Networks Probabilistic Relational Models Collective Classification & Clustering Undirected Discriminative Models Collective Classification Revisited
Discriminative training of RMNs Webpage classification Link prediction
PRMs for NLP
Learning RMNs Parameter estimation is not closed form
Convex problem unique global maximum
(Reg1.Grade,Reg2.Grade)
0 0.5 1 1.5 2
AAABACBABBBCCACBCC
P(Grades,Intelligence|Difficulty)
0 0.5 1 1.5 2
AAABACBABBBCCACBCC
0 0.5 1 1.5 2
AAABACBABBBCCACBCC
Difficulty
Difficulty
Intelligence
Intelligence
Intelligence
Grade
Grade
Grade
Grade
low / higheasy / hard ABC
L = log
)|,(
),(#
DifficAGradeAGradeP
AGradeAGradeL
AA
Intelligence
Intelligence
Intelligence
Grade
Grade
Grade
Grade
Intelligence
Intelligence
Intelligence
Grade
Grade
Grade
Grade
Maximize
Flat Models
...
PageCategory
Word1 WordN LinkWordN...
P(Category|Words)
Logistic Regressio
n
0
0.05
0.1
0.15
0.2
0.25
0.3
Naïve Bayes Logistic SVM
Exploiting Links
...
PageCategory
Word1 WordN
From-
Link ...
PageCategory
Word1 WordN
To-
0
0.05
0.1
0.15
0.2
0.25
PRM Logistic RMN-link
42.1% relative reduction in error relative to generative approach
More Complex Structure
C
Wn
W1Faculty
S
Students
S
Courses
Collective Classification: Results
00.020.040.060.080.1
0.120.140.160.18
Logistic Links Section Link+Section
35.4% relative reduction in error relative to strong flat approach
Scalability
WebKB data set size 1300 entities 180K attributes 5800 links
Network size / school: Directed model
200,000 variables 360,000 edges
Undirected model 40,000 variables 44,000 edges
Difference in training time decreases substantially when
some training data is unobserved want to model with hidden variables
3 sec 180 sec
20 minutes 15-20 sec
Directedmodels
Undirectedmodels
Training Classification
Predicting Relationships
Even more interesting are the relationships between objects e.g., verbs are almost always relationships
Tom MitchellProfessor
WebKBProject
Sean SlatteryStudent
Advisor-of
Member
Member
Rel
Flat Model
...PageWord1 WordN
From- ...
PageWord1 WordN
To-
Type
...LinkWord1 LinkWordN
NONEadvisor
instructor
TAmemberproject-
of
Flat Model
...
......
...
...
...
Collective Classification: Links
Rel
...
Page
Word1 WordN
From-
...
Page
Word1 WordN
To-
Type
...LinkWord1 LinkWordN
Category Category
Link Model
...
......
...
...
...
Triad Model
Professor Student
Group
Advisor
MemberMember
Triad Model
Professor Student
Course
Advisor
TAInstructor
Triad Model
WebKB++ Four new department web sites:
Berkeley, CMU, MIT, Stanford Labeled page type (8 types):
faculty, student, research scientist, staff, research group, research project, course, organization
Labeled hyperlinks and virtual links (6 types): advisor, instructor, TA, member, project-of, NONE
Data set size: 11K pages 110K links 2million words
Link Prediction: Results
Error measured over links predicted to be present
Link presence cutoff is at precision/recall break-even point (30% for all models) 0
5
10
15
20
25
30
Flat Labels Triad
...
... ...72.9% relative reduction in error relative to strong flat approach
Summary
PRMs inherit key advantages of probabilistic graphical models: Coherent probabilistic semantics Exploit structure of local interactions
Relational models inherently more expressive
“Web of influence”: use all available information to reach powerful conclusions
Exploit both relational information and power of probabilistic reasoning
Outline
Bayesian Networks Probabilistic Relational Models Collective Classification & Clustering Undirected Discriminative Models Collective Classification Revisited PRMs for NLP
Word-Sense Disambiguation Relation Extraction Natural Language Understanding (?)
* An outsider’s perspective
or “Why Should I Care?”*
Her advisor gave her feedback about the draft.
Word Sense Disambiguation
Neighboring words alone may not provide enough information to disambiguate
We can gain insight by considering compatibility between senses of related words
financialacademic
physicalfigurative
electricalcriticism
windpaper
Collective Disambiguation
Objects: words in text Attributes: sense, gender, number, pos, … Links:
Grammatical relations (subject-object, modifier,…) Close semantic relations (is-a, cause-of, …) Same word in different sentences (one-sense-per-
discourse) Compatibility parameters:
Learned from tagged data Based on prior knowledge (e.g., WordNet, FrameNet)
Her advisor gave her feedback about the draft.
financialacademic
physicalfigurative
electricalcriticism
windpaper
Can we infer grammatical structure and disambiguate word senses
simultaneously rather than sequentially?
Can we integrate inter-word relationships directly into our
probabilistic model?
Relation Extraction
Announcement
MillerJackson
Made
Candidate
Concerns
DepartsCEO
Of
ACME’s board of directors began a search for a new CEO after the departure of current CEO, James Jackson, following allegations of creative accounting practices at ACME. [6/01] … In an attempt to improve the company’s image, ACME is considering former judge Mary Miller for the job. [7/01] … As her first act in her new position, Miller announced that ACME will be doing a stock buyback. [9/01] …
Hired??
Professor Sarah met Jane.She explained the hole in her proof.
Understanding Language
Proof:Theorem: P=NP
N=1
Most likely interpretation:
Student Jane Professor Sarah
Resolving Ambiguity
Professors often meet with students Jane is probably a student
Professors like to explain “She” is probably Prof. Sarah
Attribute values
Link types
Object identity
[Goldman & Charniak, Pasula & Russell]
Professor Sarah met Jane.She explained the hole in her proof.
Probabilistic reasoning about objects, their attributes, and the relationships between them
Acquiring Semantic Models Statistical NLP reveals patterns:
Standard models learn patterns at word level But word-patterns are only implicit surrogates for
underlying semantic patterns “Teacher” objects tend to participate in certain relationships Can use this pattern for objects not explicitly labeled as a
teacher
teacher
betrainhire
pay
fireserenade
24%3%3%1.5%
1.4%0.3%
Competing Approaches
Logical
Statistical
SemanticUnderstanding
Scaling Up(via learning)
PRMs
Noise &AmbiguityDesiderata:
Complementary Approaches
Statistics: from Words to Semantics
Represent statistical patterns at semantic level What types of objects participate in what types
of relationships
Learn statistical models of semantics from text
Reason using the models to obtain global semantic understanding of the text
Georgia O’KeefeLadder to the Moon