Upload
brook
View
33
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Exploiting Common SubRelations: Learning One Belief Net for Many Classification Tasks. R Greiner, Wei Zhou University of Alberta. Situation. CHALLENGE: Need to learn k classifiers Cancer, from medical symptoms Meningitis, from medical symptoms Hepatitis, from medical symptoms … - PowerPoint PPT Presentation
Citation preview
Exploiting Common SubRelations:Learning One Belief Net for Many Classification
Tasks
R Greiner, Wei Zhou University of Alberta
SituationCHALLENGE: Need to learn k classifiers
Cancer, from medical symptoms Meningitis, from medical symptoms Hepatitis, from medical symptoms …
Option 1: Learn k different classifier systems{SCancer, SMenin, …, Sk}
Then use Si to deal with ith “query class” but… but… need to re-learn
inter-relations among Factors, Symptoms,
common to all k classifiers
Common Interrelationships
Cancer Menin Cancer Menin
Use Common Structure! CHALLENGE: Need to learn k classifiers
Cancer, from medical symptoms Meningitis, from symptoms Hepatitis, from symptoms …
Option 2: Learn 1 “structure” S of relationshipsthen use S to address all k classification tasks
Actual Approach: Learn 1 Bayesian BeliefNet,
inter-relating info for all k types of queries
OutlineMotivationMotivation
Handle multiple class variables
FrameworkFramework Formal model Belief Nets, …-classifier
ResultsResults Theoretical Analysis Algorithms (Likelihood vs Conditional Likelihood)
Empirical Comparison• 1 Structure vs k Structures; LL vs LCL
ContributionsContributions
Cancer Menin Gender Age Smoke Height Btest
T F 35 T
F M 25 6’
T F t
F T 5’3” t
Cancer
Menin
Gender
Age
Smoke
Height
Btest
? M 18 T f
? M 5’0” f
Cancer
Menin
Gender
Age
Smoke
Height
Btest
TT M 18 T f
FF M 5’0” f
MC-Learner
MC
Training
Data
Return value Q = q
Multi-Classifier I/OGiven “query”
“class variable” Q and “evidence” E=e
Cancer=?, given Gender=F, Age=35, Smoke=t ?
Cancer = Yes
Cancer Menin Gender Age Smoke Height Btest
? F 35 T
Cancer Menin Gender Age Smoke Height Btest
Yes F 35 T
MultiClassifier
Like standard Classifiers, can deal with different evidence E different evidence values e
Unlike standard Classifiers, can deal with different class variables Q
MC(Cancer; Gender=M, Age=25, Height=6’) = No
MC(Meningitis; Gender=F, BloodTest = t ) = Severe
Able to “answer queries” classify new unlabeled tuples
Given “Q=?, given E=e”, return “q”
MC-Learner’s I/OInput: Set of “queries” (labeled partially-specified tuples)
input to standard (partial-data) learners
Output: MultiClassifier
Query var Q Evidence vars ECancer = t Gender=F, Age=35, Smoke=t
Cancer = f Gender = M, Age = 25, Height=6’
Menin = t Gender = F, Btest = t
Cancer Menin Gender Age Smoke Height Btest
T F 35 T
F M 25 6’
T F t
Error Measure“Labeled query” [Q, E=e], q
Prob([Q, E=e] asked) Query Distribution: …can be uncorrelated with “tuple distribution”
MC(Q, E=e) = q’MultiClassifier MC returns
CE(MC) = [Q, E=e], qProb([Q, E=e] asked) [|MC(Q, E=e) =?q|]
Classification Error of MC
[|a =? b|] 1 if a=b, 0 otherwise “0/1” error
Learner’s TaskGiven
space of “MultiClassifiers” { MCi } sample of labeled queries
drawn from “query distribution”
MC*= argmin{ MCi }{CE(MCi) } w/minimal error
over query distribution.
Cancer Menin Gender Age Smoke Height Btest
T F 35 T
F M 25 6’
T F t
Find
Outline MotivationMotivation
Handle multiple class variables
FrameworkFramework Formal model
ResultsResults Theoretical Analysis Algorithms (Likelihood vs Conditional Likelihood)
Empirical Comparison• 1 Structure vs k Structure; LL vs LCL
ContributionsContributions
Belief Nets, …-classifier
Simple Belief NetH
B
J
P(J | H, B=0) = P(J | H, B=1) J, H ! P( J | H, B) = P(J | H)
J is INDEPENDENT of B, once we know HDon’t need B J arc!
h P(B=1 | H=h)
1 0.95
0 0.03
P(H=1)
0.05
h P(J=1| h )
1 0.8
0 0.3
Skip Details
Example of a Belief Net
Simple Belief Net:
0.950.05
P(H=0)P(H=1)
0.970.030
0.050.951
P(B=0 | H=h)P(B=1 | H=h)h
0.70.300
0.7
0.2
0.2
P(J=0|h,b)
0.310
0.801
0.811
P(J=1|h,b)bh
H
B
J
Node ~ VariableLink ~ “Causal dependency”
“CPTable” ~ P(child | parents)Skip
Encoding Causal Links (cont’d)H
B
J
P(J | H, B=0) = P(J | H, B=1) J, H ! P( J | H, B) = P(J | H)
J is INDEPENDENT of B, once we know HDon’t need B J arc!
h P(B=1 | H=h)
1 0.95
0 0.03
P(H=1)
0.05
h b P(J=1|h , b )
1 1 0.8
1 0 0.8
0 1 0.3
0 0 0.3
Encoding Causal Links (cont’d)H
B
J
P(J | H, B=0) = P(J | H, B=1) J, H ! P( J | H, B) = P(J | H)
J is INDEPENDENT of B, once we know HDon’t need B J arc!
h P(B=1 | H=h)
1 0.95
0 0.03
P(H=1)
0.05
h P(J=1|h )
1 0.8
1
0 0.3
0
Encoding Causal Links (cont’d)H
B
J
P(J | H, B=0) = P(J | H, B=1) J, H ! P( J | H, B) = P(J | H)
J is INDEPENDENT of B, once we know HDon’t need B J arc!
h P(B=1 | H=h)
1 0.95
0 0.03
P(H=1)
0.05
h P(J=1| h )
1 0.8
0 0.3
Include Only Causal LinksSufficient Belief Net:
Requires: P(H=1) knownP(J=1 | H=1) knownP(B=1 | H=1) known
(Only 5 parameters, not 7)
H
B
J
P(H=1)
0.05
h P(B=1 | H=h)
1 0.95
0 0.03h P(J=1 | H=h)
1 0.8
0 0.3
Hence: P(H=1 | J=0, B=1) = P(H=1) P(J=0 | H=1) P(B=1 | J=0,H=1) 1
P(B=1 | H=1)
BeliefNet as (Multi)ClassifierFor query [Q, E=e], BN will return distribution
PBN(Q=q1 | E=e ), PBN(Q=q2 | E=e ), … PBN(Q=qm | E=e )
(Multi)Classifier MCBN(Q, E=e ) = argmaxqi { PBN(Q= qi | E=e ) }
qq11 q q22 q q33 … q … qmm
ProbProb
Learning Belief Nets Belief Net = G,
G = directed acyclic graph (“structure” – what’s related to what”)
= “parameters” – strength of connections Learning Belief Net G, from “data”:
1. Learning structure G
2. Find parameters that are best, for G
Our focus: #2 (parameters); Best minimal CE-error
Learning BN Multi-Classifier
Structure G + Labeled Queries
Goal: Find CPtables to minimize CE error
…
Cancer Menin Gender Age Smoke Height Btest
T F 35 T
F M 25 6’
T F t
F T 5’3” t
* = argmin { [Q, E=e], vProb([Q, E=e] asked)
[|MC G,
(Q, E=e) =? q|] }
Issues
Q1: How many labeled queries are required?
Q2: How hard is learning, given distributional info?
Q3: What is best algorithm for learning … … Belief Net?
… Belief Net Classifier?
… Belief Net Multiclassifier?
Sample Complexity: … BN structure w/ N variables, K CPtable entries, i >0, needsample of
labeled queries.
Q1, Q2: Theoretical Results• PAC(, )-learn CPtables:
Given BN structure, find CPtables whose CE-error is, with prob 1-, within of optimal
)
K6( log K
2log
ln18 ),(
2
KM
NN
Computational Complexity: NP-hard to find CPtable w/ min’l CE error (over for any O(1/N) ) from labeled queries… from known structure!
Use Conditional LikelihoodGoal: minimize “classification error”,
based on training sample [Qi, Ei=ei], qi*
Sample typically includes high-probability queries [Q, E=e] only most likely answers to these queries
q* = argmaxq { P( Q=q | E=e ) }
LCLD( ) = [q*,e] D log P( Q=q* | E=e )
Maximize Conditional Likelihood
As NP-hard… Not standard model?
Gradient Descent Alg: ILQ
How to change CPtable c|f = B(C=c | F=f)given datum “[Q=q, E=e]”,corresponding to
Q
C
……
c|f …
…
P(C|f1, f2)F2F1
…
F2F1
E
…
Cancer Menin Gender Age Smoke Height Btest
Yes F 35 T
)|,(),|,(1
||
)|(
efcBeqfcBd
LCLd
fcfc
eq
+ sum over queries “[Q=q, E=e]”, conjugate gradient, …
Descend along derivative:
Better Algorithm: ILQConstrained Optimization
(c|f 0, c=0|f + c=1|f = 1)
New parameterization c|f :
fcc
cf
e
efc '
'|
for each “row” rj, set c0|rj = 0 for one c0
)|(),|(
)],|(),|,([
|
|
)|(
efBeqfB
eqfBeqfcBd
LCLd
fc
fc
eq
Q
C
……
c|f …
…
P(C|f1, f2)F2F1
…
F2F1
E
…
Q3: How to Learn BN MultiClassifier?
Approach 1: Minimize error Maximize Conditional Likelihood
(In)Complete Data: ILQ
Approach 2: Fit to dataApproach 2: Fit to data Maximize LikelihoodMaximize Likelihood
Complete Data: Observed Frequency EstimateComplete Data: Observed Frequency Estimate Incomplete Data: EM / Incomplete Data: EM / APNAPN
Empirical StudiesTwo different objectives 2 learning algs
Maximize Conditional Likelihood: ILQ Maximize Likelihood: Maximize Likelihood: APNAPN
Two different approaches to MultipleClasses 1 copy of structure k copies of structure k naïve-bayesk naïve-bayes
Several “datasets” Alarm Insurance ……
Error: “0/1”; MSE() = i[Ptrue(qi|ei) – P (qi|ei)]2
1- vs k- Structures
Cancer Menin
Cancer
Menin
Gender
Age
Smoke
Height
Btest
T F 3 T
F M 2 6’
T F t
F T 5’3 t
Cancer MeninMenin
Cancer
Menin
Gender
Age
Smoke
Height
Btest
T F 3 T
F M 2 6’
T
F T 5’3 t
CancerCancer Menin
Cancer
Menin
Gender
Age
Smoke
Height
Btest
T
F
T F t
F
Alarm Belief Net 37 vars, 46 links, 505 parameters
Empirical Study I: Alarm
Query Distribution [HC’91] says, typically
8 vars QQ N appear as query 16 vars EE N appear as evidence
Select Q QQ uniformly Use same set of 7 evidence E EE Assign value e for E , based on Palarm(E =e)
Find “value” v based on Palarm(Q =v | E =e)
Each run uses m such queries, m=5,10, … 100, …
Results (Alarm; ILQ; SmallSample)CE
MSE
Results (Alarm; ILQ; LargeSample)CE
MSE
Comments on Alarm ResultsFor small Sample Size
“ILQ- 1 structure” better than “ILQ- k structures”
For large Sample Size “ILQ- 1 structure” “ILQ- k structures”
• ILQ-k has more parameters to fit, but … lots of data
APN ok, but much slower (did not converge in bounds)APN ok, but much slower (did not converge in bounds)
Empirical Study II: InsuranceInsurance Belief Net
27 vars, (3 query, 8 evidence) 560 parameters
Distribution: Select 1 query
randomly from 3 Use all 8 evidence …
(Simplified Version)
Results (Insur; ILQ)CE
MSE
Summary of ResultsLearning for given structure,
to minimize CED() or MSED() Correct structure
Small number of samples
• ILQ-1 (APN-1)(APN-1) win (over ILQ-k, APN-k, APN-k)
Large number of samples
• ILQ-k ILQ-1win (over APN-1, APN-k)(over APN-1, APN-k)
Incorrect structure (naïve-bayes)Incorrect structure (naïve-bayes) ILQ winsILQ wins
Future WorkBest algorithm for learning optimal BN?
Actually optimize CE-Err (not LCL) Learning STRUCTURE as well as CPtables Special cases where ILQ is efficient (?complete data?)
Other “learning environments” Other prior knowledge -- Query Forms Explicitly-Labeled Queries
Better understanding of sample complexityw/out “” restriction
Related WorkLike (ML) classification but…
Probabilities, not discrete Diff class var’s, diff evidence sets... … see Caruana
““Learning to Reason” [KR’95]Learning to Reason” [KR’95]“do well on tasks that will be encountered”“do well on tasks that will be encountered”… but different performance system… but different performance system
Sample Complexity [FY, Hoeffgen] … diff learning model
Computational Complexity [Kilian/Naor95] NP to find ANY distr w/min L1-error wrt uncond query for BN L2 conditional
Exploiting Common Relations 40
Take Home Msgs To max performance:
use Conditional Likelihood (ILQ)not Likelihood (APN/EM, OFE)
Especially if structure wrong, small sample, …
… controversial… To deal with MultiClassifiers
Use 1 structure, not k If small sample, 1struct better performance
If large sample, same performance, … but 1struct smaller
… yes, of course… Relation to Attrib vs Relation:
Not “1 example for many class of queries” but “1 example for 1 class of queries,
BUT IN ONE COMMON STRUCTURE”
ContributionsAppropriate model for learning
Extends standard learning environments Labeled Queries, with different class variables
Sample Complexity Need “few” labeled-queries
Computation Complexity Effective Algorithm NP-hard Gradient descent
Empirical Evidence: works well!http://www.cs.ualberta.ca/~greiner/BN-results.html
Learn MultiClassifier that works well in practice
Questions?
LCL vs LLDoes diff matter?ILQ vs APNQuery FormsSee also
http://www.cs.ualberta.ca/~greiner/BN-results.html
Learning ModelMost belief net learners try to maximize
LIKELIHOOD
Our goal is different: We want to minimize error, over distribution of queries.
LL D ( ) = xD log P( x )
… as goal is “fit to data” D
If never asked
don’t care if
“What is p(jaun | btest- ) ?”
BN(jaun | btest- ) p(jaun | btest- )
Different Optimization
LL D( ) = [q*,e] D log P( Q=q* | E=e ) + [q*,e] D log P( E=e )
= LCL D( ) + [q*,e] D log P( E=e )
As [q*,e] D log P( E=e ) non-trivial,
LL = argmax { LL D( ) } LCL = argmax {LCL D( ) }
Discriminant analysis: Maximize Overall Likelihood vsMinimize Predictive Error
To find LCL: NP-hard, so…ILQ Return
LL LCL
A belief net is … representation for a distributionrepresentation for a distribution system for answering queriessystem for answering queries
Suppose BN must answer: “What is p(hep | jaun, btest- ) ?”
but not “What is p(jaun | btest-) ?”
So… BN is good if even if
Why Alternative Model?
BN(hep | jaun, btest- ) = p(hep | jaun, btest- )
BN(jaun | btest- ) p(jaun | btest- )
Query Distr vs Tuple Distr
Distribution over tuples p(q) p(hep, jan, btest-, …) = 0.07
p(flu, cough, ~headache, …) = 0.43 Distribution over queries sq(q) = Prob(q asked)
Ask “What is p(hep | jan, btest-)?” 30%
Ask “What is p(flu | cough, ~headache)?” 22%
Can be uncorrelated: EG: Prob[ Asking Cancer ] = sq(“cancer”) = 100%
even if Pr[ Cancer ] = p(cancer) = 0
Query Distr Tuple DistrSpse GP asks all ADULT FEMALE patients
“Pregnant” ? Pregnant Adult Gender
+ + F
- + F
+ + F
Data P( Preg | Adult, Gender=F ) = 2/3Is this really TUPLE distr?
• P(Gender=F) = 1 ?
NO: only reflects questions asked ! Provide info re: P(preg | Adult=+, Gender=F) but NOT about P(Adult), …
Query Distr Tuple DistrQuery Probability:
independent of tuple probability:
Prob([Q, E=e] asked)
[Q, E=e], q* Note: value of query -- q* of -- IS based on P(Q=q | E=e)
P(Q=q, E=e)• Could always ask about 0-prob situation• Always ask “[Pregnant=t, Gender=Male]” sq(Pregnant=t, Gender=Male)=1, but P(Pregnant=t, Gender=Male ) = 0
P(E=e)• If sq(Q, E=ei) P(E=ei), then
• P(Gender=Female ) = P(Gender=Male ) sq(Pregnant, Gender=Female) = sq(Pregnant, Gender=Male)
Return
Does it matter? If all queries involve same query variablesame query variable, ok to pretend
sq(.) ~ p(.) as no-one ever asks about EVIDENCE DISTRIBUTION
Pregnant Adult Gender
+ + F
- + F
+ + F
Eg, in
As no one asks “What is P(Gender)?, doesn’t matter …
But problematic in MultiClassifier…
if other queries – eg, sq(Gender; .)
ILQ (cond likelihood) vs APN (likelihood)
Wrong structure: ILQ better than APN/EM Experiments…
• Artificial data
• Using Naïve Bayes (UCI)
Correct structure ILQ often better than OFE, APN/EM Experiments
Discriminant analysis: Maximize Overall Likelihood vs Minimize Predictive Error
Wrong Structure Imth target distribution:
“TAN” w/ E1 E2 … Em
EE11 EEmm EEkk……EE11……
EE11 EEmm EEkk……EE11……
Wrong structure (NB)
Results… (k=5, m=0..4)
Results – Wrong StructureCE
MSE
Wrong Structure IILearn NaiveBayes for REAL-World dataset:
Chess (FLARE, DNA)
CE:
MSE:
Correct StructureIf structure is correct
ILQ, OFE / (APN,EM) should converge to optimal
Which is more efficient?
Depends…
“Correct” StructureFill in parameter for CORRECT STRUCTURE
for REAL-World dataset: Chess (FLARE, DNA)Structure learned using PowerConstructor
CE:
Summary of Results
Dataset ILQ OFE EM APNFlare 0.1756 0.198
DNA 0.0489 0.0557
Chess 0.0558 0.1423
Vote 0.0345 0.1057 0.1057
MSE results vs OFE if data complete vs APN/EM if data incomplete
Query FormsMD asks “What is P(D|A, B)?” 20%
sq(D=t|A,B) = 0.2 sq(D=t; A,B) = i sq(D=t; A=ai, B=bi)
Challenge#1: Subdistribution: sq(D; A=ai,B=bi) = sq(D; A,B)
(A=ai,B=bi| “Asked D|A,B question”)
Perhaps is Uniform? …or = P(A=ai,B=bi) ? NO! sq(Pregnant | Gender) = 1.0
? sq( Preg | Gend=M) = sq( Preg | Gend=F) ??
Challenge#2: Need 2k labels!… but in UQT model, perhaps not needed…
In UQT, may need to SHRINK network!… but QueryFORMS may be sufficient!
Back
For each query p(q|e), for each cf,
Efficiency of ILQ
If q,e d-separated from c,f: d LCL(q|e) /d c|f = 0, skip!Saves 10-90% of work!
Current Timing (PIII-500)
ALARM: 100 millisec / query (each iter) INSURANCE: 30 millisec/query (each iter)
)|(),|(
)],|(),|,([
|
|
)|(
efBeqfB
eqfBeqfcBd
LCLd
fc
fc
eq
Results (Alarm; ILQ-1/APN-1; LargeSample)
CE
MSE