THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under...

Preview:

Citation preview

THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY

CSIT 5220:  Reasoning and Decision under Uncertainty

L10: Model-Based Classification and Clustering

Nevin L. ZhangRoom 3504, phone: 2358-7015,

Email: lzhang@cs.ust.hk   Home page

CSIT 5220

L10: Model-Based Classification and Clustering

Probabilistic Models (PMs) for Classification

PMs for Clustering

Page 2

CSIT 5220

The problem:

Given data:

Find mapping (A1, A2, …, An) |- C

Possible solutions

ANN

Decision tree (Quinlan)

(SVM: Continuous data)

Classification

CSIT 5220

Probabilistic Approach to Classification

CSIT 5220Page 5

Will Boss Play Tennis?

CSIT 5220Page 6

Will Boss Play Tennis?

CSIT 5220Page 7

CSIT 5220Page 8

CSIT 5220Page 9

CSIT 5220Page 10

CSIT 5220Page 11

Naïve Bayes model often has good performance in practice

Drawbacks of Naïve Bayes: Attributes mutually independent given class variable

Often violated, leading to double counting.

Fixes: General BN classifiers

Tree augmented Naïve Bayes (TAN) models

Bayesian Networks for Classification

CSIT 5220Page 12

General BN classifier Treat class variable just as another variable

Learn a BN.

Classify the next instance based on values of variables in the Markov

blanket of the class variable.

Pretty bad because it does not utilize all available information because

of Markov boundary

Bayesian Networks for Classification

CSIT 5220Page 13

Bayesian Networks for Classification

Tree-Augmented Naïve Bayes (TAN) model Capture dependence among attributes using a tree structure.

During learning, First learn a tree among attributes: use Chow-Liu algorithm

Special structure learning problem, easy

Add class variable and estimate parameters

Classification arg max_c P(C=c|A1=a1, …, An=an)

BN inference

Many other methods

CSIT 5220

Task: Find a tree model over observed variables that has maximum

likelihood given data.

Maximized loglikelihood

Chow-Liu Trees

CSIT 5220

CSIT 5220

CSIT 5220

CSIT 5220

CSIT 5220

CSIT 5220

CSIT 5220

Mutual Information

Chow-Liu Trees

Task is equivalent to finding maximum spanning tree of the following weighted and undirected graph:

CSIT 5220

Maximum Spanning Trees

CSIT 5220

http://www.cs.cmu.edu/~guestrin/Class/15781/recitations/r10/11152007chowliu.pdf

Illustration of Kruskal’s Algorithm

CSIT 5220

L10: Probabilistic Models (PMs) for Classification and Clustering

Page 24

Probabilistic Models (PMs) for Classification

PMs for Clustering

CSIT 5220Page 25

CSIT 5220Page 26

CSIT 5220Page 27

CSIT 5220Page 28

CSIT 5220Page 29

CSIT 5220Page 30

CSIT 5220Page 31

CSIT 5220Page 32

CSIT 5220

An Medical Application

In medical diagnosis, sometimes gold standard exists

Example: Lung Cancer

Symptoms: Persistent cough, Hemoptysis (Coughing up blood), Constant chest

pain, Shortness of breath, Fatigue, etc

Information for diagnosis: symptoms, medical history, smoking

history, X-ray, sputum.

Gold standard: Biopsy: the removal of a small sample of tissue for examination under

a microscope by a pathologist

CSIT 5220

An Medical Application

Sometimes gold standard does not exist

Example: Rheumatoid Arthritis (RA)

Symptoms: Back Pain, Neck Pain, Joint Pain, Joint Swelling, Morning Joint

Stiffness, etc

Information for diagnosis: Symptoms, medical history, physical exam,

Lab tests including a test for rheumatoid factor.

(Rheumatoid factor is an antibody found in the blood of about 80 percent of

adults with RA. )

No gold standard: None of the symptoms or their combinations are not clear-cut indicators of RA

The presence or absence of rheumatoid factor does not indicate that one has RA.

CSIT 5220

LC Analysis of Hannover Rheumatoid Arthritis Data

Class specific probabilities

Cluster 1: “disease” free

Cluster 2: “back-pain type”

Cluster 3: “Joint type”

Cluster 4: “Severe type”