Upload
finna
View
44
Download
6
Tags:
Embed Size (px)
DESCRIPTION
Probabilistic Models of Relational Data. Daphne Koller Stanford University Joint work with:. Ben Taskar. Pieter Abbeel. Lise Getoor. Eran Segal. Nir Friedman. Avi Pfeffer. Ming-Fai Wong. Why Relational?. - PowerPoint PPT Presentation
Citation preview
Probabilistic Models of Relational Data
Daphne KollerStanford University
Joint work with:
Lise GetoorMing-Fai Wong
Eran SegalAvi Pfeffer
Pieter AbbeelNir Friedman
Ben Taskar
Why Relational? The real world is composed of objects that
have properties and are related to each other
Natural language is all about objects and how they relate to each other “George got an A in Geography 101”
Attribute-Based Worlds
Smart students get A’s in easy classesSmart_Jane & easy_CS101 GetA_Jane_CS101 Smart_Mike & easy_Geo101 GetA_Mike_Geo101 Smart_Jane & easy_Geo101 GetA_Jane_Geo101 Smart_Rick & easy_CS221 GetA_Rick_C
World = assignment of values to attributes / truth values to propositional symbols
Object-Relational Worlds
World = relational interpretation: Objects in the domain Properties of these objects Relations (links) between objects
x,y(Smart(x) & Easy(y) & Take(x,y) Grade(A,x,y))
Why Probabilities? All universals are false
Smart students get A’s in easy classes True universals are rarely useful
Smart students get either A, B, C, D, or F C student
The actual science of logic is conversant at present only The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful …with things either certain, impossible, or entirely doubtful …
(almost)
James Clerk MaxwellJames Clerk Maxwell
Therefore the true logic for this world is the calculusTherefore the true logic for this world is the calculusof probabilities …of probabilities …
Probable Worlds Probabilistic semantics:
A set of possible worlds Each world associated with a probability
hardsmart
A
hardsmart
B
hardsmart
C
hardweak
A
hardweak
B
hardweak
C
easysmart
A
easysmart
B
easysmart
C
easyweak
A
easyweak
B
easyweak
C
course difficultystudent intell.grade
Representation: Design Axes
Attributes Objects
Cat
egor
ical
Prob
abili
stic
Epis
tem
ic s
tate
World state
Propositional logic Propositional logic CSPsCSPs
First-order logicFirst-order logicRelational databasesRelational databases
Sequences
AutomataAutomataGrammarsGrammars
Bayesian netsBayesian netsMarkov netsMarkov nets
n-gram modelsn-gram modelsHMMsHMMs
Prob. CFGsProb. CFGs
Outline Bayesian Networks
Representation & Semantics Reasoning
Probabilistic Relational Models Collective Classification Undirected discriminative models Collective Classification Revisited PRMs for NLP
Bayesian Networks
nodes = variablesedges = direct influence
Graph structure encodes independence assumptions: Letter conditionally independent of Intelligence given Grade
0% 20% 40% 60% 80% 100%hard,highhard,low
easy,higheasy,low
A B CCPD P(G|D,I)
Letter
Grade
SAT
IntelligenceDifficulty
BN semantics
Compact & natural representation: nodes have k parents 2kn vs. 2n params parameters natural and easy to elicit
conditionalindependenciesin BN structure
+local
probabilitymodels
full jointdistribution
over domain=
LG
S
ID
i)|P(sg)|P(li)d,|P(gP(i)P(d)s)l,g,i,P(d,
Full joint distribution specifies answer to any query: P(variable | evidence about others)
Reasoning using BNs
Letter
Grade
SAT
IntelligenceDifficulty
Letter SAT
Probability theory is nothing butProbability theory is nothing butcommon sense reduced to calculation.common sense reduced to calculation.
Pierre Simon LaplacePierre Simon Laplace
BN Inference BN Inference is NP-hard Structure can use graph structure:
Graph separation conditional independence Do separate inference in parts Results combined over interface.
A
C
B
D
FE Complexity: exponential in largest separator
Structured BNs allow effective inference Exact inference in dense BNs is intractable
Approximate BN Inference Belief propagation is an iterative message
passing algorithm for approximate inference in BNs
Each iteration (until “convergence”): Nodes pass “beliefs” as messages to neighboring
nodes Cons:
Limited theoretical guarantees Might not converge
Pros: Linear time per iteration Works very well in practice, even for dense networks
Outline Bayesian Networks Probabilistic Relational Models
Language & Semantics Web of Influence
Collective Classification Undirected discriminative models Collective Classification Revisited PRMs for NLP
Bayesian Networks: Problem
Bayesian nets use propositional representation Real world has objects, related to each other
Intelligence Difficulty
Grade
Intell_Jane Diffic_CS101
Grade_Jane_CS101
Intell_George Diffic_Geo101
Grade_George_Geo101
Intell_George Diffic_CS101
Grade_George_CS101A C
These “instances” are not independent
Probabilistic Relational Models
Combine advantages of relational logic & BNs: Natural domain modeling: objects, properties,
relations Generalization over a variety of situations Compact, natural probability models
Integrate uncertainty with relational model: Properties of domain entities can depend on
properties of related entities Uncertainty over relational structure of domain
St. Nordaf University
Teac
hes
Teac
hes
In-course
In-course
Registered
In-course
Prof. SmithProf. Jones
George
Jane
Welcome to
CS101
Welcome to
Geo101
Teaching-abilityTeaching-ability
Difficulty
Difficulty Registered
RegisteredGrade
Grade
Grade
Satisfac
Satisfac
Satisfac
Intelligence
Intelligence
Relational Schema Specifies types of objects in domain, attributes of
each type of object & types of relations between objects
Teach
StudentIntelligence
RegistrationGradeSatisfaction
CourseDifficulty
ProfessorTeaching-Ability
In
Take
ClassesClasses
RelationsRelationsAttributesAttributes
Probabilistic Relational Models
Universals: Probabilistic patterns hold for all objects in class Locality: Represent direct probabilistic dependencies
Links define potential interactions
StudentIntelligence
RegGradeSatisfaction
CourseDifficulty
ProfessorTeaching-Ability
[K. & Pfeffer; Poole; Ngo & Haddawy]
0% 20% 40% 60% 80% 100%
hard,high
hard,low
easy,high
easy,lowA B C
Prof. SmithProf. Jones
Welcome to
CS101
Welcome to
Geo101
PRM SemanticsTeaching-abilityTeaching-ability
Difficulty
Difficulty
Grade
Grade
Grade
Satisfac
Satisfac
Satisfac
Intelligence
Intelligence
Instantiated PRM BN variables: attributes of all objects dependencies: determined by links & PRM
George
Jane
Welcome to
CS101
low / high
The Web of Influence
0% 50% 100%0% 50% 100%
Welcome to
Geo101 A
C
low high
0% 50% 100%
easy / hard
Outline Bayesian Networks Probabilistic Relational Models Collective Classification & Clustering
Learning models from data Collective classification of webpages
Undirected discriminative models Collective Classification Revisited PRMs for NLP
Learning PRMs
LearnerLearnerRelationalDatabase
Course Student
Reg
D
Expert knowledge
[Friedman, Getoor, K., Pfeffer]
Learning PRMs Parameter estimation:
Probabilistic model with shared parameters Grades for all students share same model
Can use standard techniques for max-likelihood or Bayesian parameter estimation
Structure learning: Define scoring function over structures Use combinatorial search to find high-scoring
structure
).,.*,.(#).,.,.(#
).,.|.(ˆ
loDiffCoursehiIntellStudentGradeRegloDiffCoursehiIntellStudentAGradeReg
loDiffCoursehiIntellStudentAGradeRegP
Web KBTom MitchellProfessor
WebKBProject
Sean SlatteryStudent
Advisor-of
Project-of
Member
[Craven et al.]
Web Classification Experiments
WebKB dataset Four CS department websites Bag of words on each page Links between pages Anchor text for links
Experimental setup Trained on three universities Tested on fourth Repeated for all four combinations
Professordepartment
extractinformationcomputersciencemachinelearning
…
Standard Classification
Categories:facultycourseprojectstudentother
00.050.1
0.150.2
0.250.3
0.35
words only
Naïve Bayes
Page
...Category
Word1 WordN
Exploiting Links
... LinkWordN
workingwithTom Mitchell …
00.050.1
0.150.2
0.250.3
0.35
words only link words
Page
...Category
Word1 WordN
Collective Classification
...
PageCategory
Word1 WordN
From-
...
PageCategory
Word1 WordN
LinkExists
To-
[Getoor, Segal, Taskar, Koller]Approx. inference: belief propagation 0
0.050.1
0.150.2
0.250.3
0.35
words only link words collective
Classify all pages collectively,
maximizing the joint label probability
P(Registration.Grade | Course.Difficulty, Student.Intelligence)
0% 20% 40% 60% 80% 100%hard,highhard,low
easy,higheasy,low
Learning w. Missing Data: EM
0% 20% 40% 60% 80% 100%hard,highhard,low
easy,higheasy,low
0% 20% 40% 60% 80% 100%hard,highhard,low
easy,higheasy,low
0% 20% 40% 60% 80% 100%hard,highhard,low
easy,higheasy,low
0% 20% 40% 60% 80% 100%hard,highhard,low
easy,higheasy,low
low / higheasy / hard
A B C
CoursesStudents
[Dempster et al. 77]
Discovering Hidden Types
Internet Movie Databasehttp://www.imdb.com
Actor
Director
MovieGenres Rating
Year #VotesMPAA Rating
Discovering Hidden Types
Type Type
Type
[Taskar, Segal, Koller]
Directors
Steven SpielbergTim BurtonTony ScottJames CameronJohn McTiernanJoel Schumacher
Alfred HitchcockStanley KubrickDavid LeanMilos FormanTerry GilliamFrancis Coppola
Actors
Anthony HopkinsRobert De NiroTommy Lee JonesHarvey KeitelMorgan FreemanGary Oldman
Sylvester StalloneBruce WillisHarrison FordSteven SeagalKurt RussellKevin CostnerJean-Claude Van DammeArnold Schwarzenegger
…
MoviesWizard of OzCinderellaSound of MusicThe Love BugPollyannaThe Parent TrapMary PoppinsSwiss Family Robinson
…
Terminator 2BatmanBatman ForeverGoldenEyeStarship TroopersMission: Impossible Hunt for Red October
Discovering Hidden Types
Outline Bayesian Networks Probabilistic Relational Models Collective Classification & Clustering Undirected Discriminative Models
Markov Networks Relational Markov Networks
Collective Classification Revisited PRMs for NLP
Directed Models: Limitations
Acyclicity constraint limits expressive power:
Two objects linked to by a student probably not both professors
Allow arbitrary patterns over sets of objects & links
Acyclicity forces modeling of all potential links:
Network size O(N2) Inference is quadratic
Generative training: Train to fit all of data, not
to maximize accuracy
Influence flows over existing links, exploiting link graph sparsity
Network size O(N)
Allow discriminative training: Max P (labels | observations)
Solution: Undirected Models
[Lafferty, McCallum, Pereira]
Markov Networks
Graph structure encodes independence assumptions: Chris conditionally independent of Eve given Alice & Dave
Chris Dave
EveAlice
A)(E,E)(D,D)(C,C)B,(A,E)D,C,B,P(A, Z1
Betty
0 0.5 1 1.5 2TTTTTFTFTTFFFTTFTFFFTFFF
ABC Compatibility (A,B,C)
Relational Markov Networks
Universals: Probabilistic patterns hold for all groups of objects
Locality: Represent local probabilistic dependencies Sets of links give us possible interactions
Study Group
Student2
Reg2GradeIntelligence
Course
RegGrade
Student
Difficulty
Intelligence
[Taskar, Abbeel, Koller ‘02]
0 0.5 1 1.5 2AAABACBABBBCCACBCC
Template potential
RMN SemanticsInstantiated RMN MN variables: attributes of all objects dependencies: determined by links & RMN
George
Jane
Welcome to
CS101
Welcome to
Geo101
Difficulty
Difficulty
Jill
Geo Study Group
CS Study Group
Intelligence
Intelligence
Intelligence
Grade
Grade
Grade
Grade
Outline Bayesian Networks Probabilistic Relational Models Collective Classification & Clustering Undirected Discriminative Models Collective Classification Revisited
Discriminative training of RMNs Webpage classification Link prediction
PRMs for NLP
Learning RMNs Parameter estimation is not closed form
Convex problem unique global maximum
(Reg1.Grade,Reg2.Grade)
0 0.5 1 1.5 2AAABACBABBBCCACBCC
P(Grades,Intelligence|Difficulty)
0 0.5 1 1.5 2AAABACBABBBCCACBCC
0 0.5 1 1.5 2AAABACBABBBCCACBCC
Difficulty
Difficulty
Intelligence
Intelligence
Intelligence
Grade
Grade
Grade
Grade
low / higheasy / hard ABC L = log
)|,(
),(#
DifficAGradeAGradeP
AGradeAGradeLAA
Intelligence
Intelligence
Intelligence
Grade
Grade
Grade
Grade
Intelligence
Intelligence
Intelligence
Grade
Grade
Grade
Grade
Maximize
Flat Models
...
PageCategory
Word1 WordN LinkWordN...
P(Category|Words)
Logistic Regressio
n
0
0.05
0.1
0.15
0.2
0.25
0.3
Naïve Bayes Logistic SVM
Exploiting Links
...
PageCategory
Word1 WordN
From-
Link ...
PageCategory
Word1 WordN
To-
0
0.05
0.1
0.15
0.2
0.25
PRM Logistic RMN-link
42.1% relative reduction in error relative to generative approach
More Complex Structure
CWn
W1Faculty
S
Students
S
Courses
Collective Classification: Results
00.020.040.060.080.1
0.120.140.160.18
Logistic Links Section Link+Section
35.4% relative reduction in error relative to strong flat approach
Scalability WebKB data set size
1300 entities 180K attributes 5800 links
Network size / school: Directed model
200,000 variables 360,000 edges
Undirected model 40,000 variables 44,000 edges
Difference in training time decreases substantially when
some training data is unobserved want to model with hidden variables
3 sec 180 sec
20 minutes 15-20 sec
Directedmodels
Undirectedmodels
Training Classification
Predicting Relationships
Even more interesting are the relationships between objects e.g., verbs are almost always relationships
Tom MitchellProfessor
WebKBProject
Sean SlatteryStudent
Advisor-of
Member
Member
Rel
Flat Model
...PageWord1 WordN
From- ...
PageWord1 WordN
To-
Type
...LinkWord1 LinkWordN
NONEadvisor
instructor
TAmemberproject-
of
Flat Model
...
......
...
...
...
Collective Classification: Links
Rel
...
Page
Word1 WordN
From-
...
Page
Word1 WordN
To-
Type
...LinkWord1 LinkWordN
Category Category
Link Model
...
......
...
...
...
Triad Model
Professor Student
Group
Advisor
MemberMember
Triad Model
Professor Student
Course
Advisor
TAInstructor
Triad Model
WebKB++ Four new department web sites:
Berkeley, CMU, MIT, Stanford Labeled page type (8 types):
faculty, student, research scientist, staff, research group, research project, course, organization
Labeled hyperlinks and virtual links (6 types): advisor, instructor, TA, member, project-of, NONE
Data set size: 11K pages 110K links 2million words
Link Prediction: Results
Error measured over links predicted to be present
Link presence cutoff is at precision/recall break-even point (30% for all models) 0
5
10
15
20
25
30
Flat Labels Triad
...
... ...72.9% relative reduction in error relative to strong flat approach
Summary PRMs inherit key advantages of
probabilistic graphical models: Coherent probabilistic semantics Exploit structure of local interactions
Relational models inherently more expressive
“Web of influence”: use all available information to reach powerful conclusions
Exploit both relational information and power of probabilistic reasoning
Outline Bayesian Networks Probabilistic Relational Models Collective Classification & Clustering Undirected Discriminative Models Collective Classification Revisited PRMs for NLP
Word-Sense Disambiguation Relation Extraction Natural Language Understanding (?)
* An outsider’s perspective
or “Why Should I Care?”*
Her advisor gave her feedback about the draft.
Word Sense Disambiguation
Neighboring words alone may not provide enough information to disambiguate
We can gain insight by considering compatibility between senses of related words
financialacademic
physicalfigurative
electricalcriticism
windpaper
Collective Disambiguation
Objects: words in text Attributes: sense, gender, number, pos, … Links:
Grammatical relations (subject-object, modifier,…) Close semantic relations (is-a, cause-of, …) Same word in different sentences (one-sense-per-discourse)
Compatibility parameters: Learned from tagged data Based on prior knowledge (e.g., WordNet, FrameNet)
Her advisor gave her feedback about the draft.
financialacademic
physicalfigurative
electricalcriticism
windpaper
Can we infer grammatical structure and disambiguate word senses
simultaneously rather than sequentially?
Can we integrate inter-word relationships directly into our
probabilistic model?
Relation Extraction
Announcement
MillerJacksonMade
Candidate
Concerns
DepartsCEO
Of
ACME’s board of directors began a search for a new CEO after the departure of current CEO, James Jackson, following allegations of creative accounting practices at ACME. [6/01] … In an attempt to improve the company’s image, ACME is considering former judge Mary Miller for the job. [7/01] … As her first act in her new position, Miller announced that ACME will be doing a stock buyback. [9/01] …
Hired??
Professor Sarah met Jane.She explained the hole in her proof.
Understanding Language
Proof:Theorem: P=NP
N=1
Most likely interpretation:
Student Jane Professor Sarah
Resolving Ambiguity
Professors often meet with students Jane is probably a student
Professors like to explain “She” is probably Prof. Sarah
Attribute valuesLink typesObject identity
[Goldman & Charniak, Pasula & Russell]
Professor Sarah met Jane.She explained the hole in her proof.
Probabilistic reasoning about objects, their attributes, and the relationships between them
Acquiring Semantic Models Statistical NLP reveals patterns:
Standard models learn patterns at word level But word-patterns are only implicit surrogates for
underlying semantic patterns “Teacher” objects tend to participate in certain relationships Can use this pattern for objects not explicitly labeled as a
teacher
teacher
betrainhire
pay
fireserenade
24%3%3%1.5%
1.4%0.3%
Competing Approaches
Logical
Statistical
SemanticUnderstanding
Scaling Up(via learning)
PRMs
Noise &AmbiguityDesiderata:
Complementary Approaches
Statistics: from Words to Semantics Represent statistical patterns at semantic
level What types of objects participate in what types
of relationships
Learn statistical models of semantics from text
Reason using the models to obtain global semantic understanding of the text
Georgia O’KeefeLadder to the Moon