Lecture 4

By Dr. Borne 2005 UMUC Data Mining Lecture 4 1

Data Mining UMUC CSMN 667

Lecture #4


Case Analysis Paper - Reminder• Please e-mail me your suggested topic (the application

area to be researched) so that I may verify that it is okay.


Clarification on Case Analysis Term Paper

• Your term paper should be a research paper on your selected topic: how has data mining been used in that topic area? etc.

• You should not carry out the data mining exercise yourself with those data. But, you should report on the findings of other organizations who have worked in that area.


Math, what math?

• Should you pay detailed attention to the math in the textbook? … Yes and No ...– Yes, if I have discussed it in these Lectures, or

in the WebTycho discussions, or in the Lab Exercises … but you do not need to know the details of the equation derivations or proofs.

– No, if you have not seen it in these Lectures or in WebTycho discussions or in lab exercises.


Lecture 4

“Classification - Part 1”age income student credit_rating

<=30 high no fair<=30 high no excellent31…40 high no fair>40 medium no fair>40 low yes fair>40 low yes excellent31…40 low yes excellent<=30 medium no fair<=30 low yes fair>40 medium yes fair<=30 medium yes excellent31…40 medium no excellent31…40 high yes fair>40 medium no excellent


Outline• Introduction to Classification Applications• Issues in Classification• Statistical Algorithms

– Regression– Bayesian classification

• Decision Tree Classification• Decision Tree Algorithms• Other Classification Methods• A Classification Example


Data Mining: From Applications to Algorithms(perhaps useful in preparing your Case Analysis Term Paper)


Introduction to Classification Applications

• Classification = to learn a function that classifies the data into a set of predefined classes.– predicts categorical class labels (i.e., discrete labels)– classifies data (constructs a model) based on the training set

and on the values (class labels) in a classifying attribute; and then uses the model to classify new database entries.

Example: A bank might want to learn a function that determines whether a customer should get a loan or not. Decision trees and Bayesian classifiers are examples of classification algorithms. This is called Credit Scoring.

Other applications: Credit approval; Target marketing; Medical diagnosis; Outcome (e.g., Treatment) analysis.


Classification - a 2-Step Process• Model Construction (Description): describing a set of

predetermined classes = Build the Model.– Each tuple/sample is assumed to belong to a predefined class, as

determined by the class label attribute

– The set of tuples used for model construction = the training set

– The model is represented by classification rules, decision trees, or mathematical formulae

• Model Usage (Prediction): for classifying future or unknown objects, or for predicting missing values = Apply the Model.– It is important to estimate the accuracy of the model:

• The known label of test sample is compared with the classified result from the model

• Accuracy rate is the percentage of test set samples that are correctly classified by the model

• Test set is chosen completely independent of the training set, otherwise over-fitting will occur


When to use Classification Applications?

• If you do not know the types of objects stored in your database, then you should begin with a Clustering algorithm, to find the various clusters (classes) of objects within the DB. This is Unsupervised Learning.

• If you already know the classes of objects in your database, then you should apply Classification algorithms, to classify all remaining (or newly added) objects in the database using the known objects as a training set. This is Supervised Learning.

• If you are still learning about the properties of known objects in the database, then this is Semi-Supervised Learning, which may involve Neural Network techniques.


Supervised vs. Unsupervised Learning

• Unsupervised learning (clustering)

– The class labels of training data are unknown.

– Start with a set of measurements, observations, etc.

with the aim of establishing the existence of classes or

clusters within the data.

• Supervised learning (classification)

– Supervision: The training data (observations,

measurements, etc.) are accompanied by labels

indicating the class for each of the observations.

– New data are classified based on the training set.


Sample Classification Problem:(1) Build the Model (Descriptive)

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier(Model = rules)


Sample Classification Problem:(2) Apply the Model (Predictive)

Classifier(rules)

TestData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

Tenured?

(Jeff, Professor, 4)

Not Perfect


Outline

• Introduction to Classification Applications• Issues in Classification• Statistical Algorithms




True Funny Story from the Workplace*

• "Discrepancy!" --– Database administrator (DBA) runs the

quarterly sales report as usual. But the assistant to the CEO demands that the DBA run it again. Why? Because … “The actual sales numbers don't match the projected numbers!”

(*copyright: 2003 Computerworld, Inc.)


Issues in Classification - 1

• Data Preparation:– Data cleaning

• Preprocess data in order to reduce noise and handle missing values

– Relevance analysis (feature selection) • The “interestingness problem”

• Remove the irrelevant or redundant attributes

– Data transformation• Generalize and/or normalize data (set values on a 0 to 1 scale)

• Data are made categorical (sometimes called “nominal”. i.e., discrete)

• If the data are continuous-valued, they should be discretized

(continued)


Issues in Classification - 2• Handling different data types:

– Continuous:• Numeric (e.g., salaries, ages, temperatures, rainfall, sales)

– Discrete:• Binary (0 or 1; Yes/No; Male/Female)• Boolean (True/False)• Specific list of allowed values (e.g., zip codes; country names)

– Categorical:• Non-numeric (character/text data) (e.g., people’s names)• Can be Ordinal (ordered) or Nominal (not ordered)• Reference: http://www.twocrows.com/glossary.htm#anchor311516

• Examples of Classification Techniques:– Regression for continuous numeric data– Logistic Regression for discrete data– Bayesian Classification for categorical data


Issues in Classification - 3• Robustness:

– Handling noise and missing values

• Speed and scalability of model– time to construct the model– time to use the model

• Scalability of implementation– ability to handle ever-growing databases

• Interpretability: – understanding and insight provided by the model

• Goodness of rules– decision tree size– compactness of classification rules

• Predictive accuracy


Issues in Classification - 4

• Overfitting– Definition: If your classifier (machine learning

model) fits noise (i.e., pays attention to parts of the data that are irrelevant), then it is overfitting.

(This diagram will be explained in next week’s lecture slides.)

GOOD BAD


Assigned Class A Assigned Class B

Is Class A Is Class A

Assigned Class A Assigned Class B

Is Class B Is Class B

How good is a classification algorithm?Different schemes for characterizing performance:

Retrieved Not Retrieved

Relevant Relevant

Retrieved Not Retrieved

Not Relevant Not Relevant

True Positive False Negative

False Positive True Negative

(applies to Information Retrieval algorithms)

Good Bad

Bad Good

(taken from Figure 4.3

in Dunham Texbook)

ASSESSMENT


Classification Performance:Person Height Database Example

True Positive

True NegativeFalse Positive

False Negative

20%

25%45%

10%


Outline





Statistical Algorithms• Regression -- a classifier (such as a decision

tree) that predicts values of continuous variables.

• Logistic Regression -- a classifier that predicts boolean (Yes/No, True/False, 0/1) values (discrete estimators).

• Bayesian Classification -- a probabilistic classification approach that is based on prior assumptions about the distribution of model parameters; predicts categorical values


(from last week): Regression

• Regression is a predictive technique that discovers relationships between input and output patterns, where the values are continuous or real valued.

• Many traditional statistical regression models are linear.

• Neural networks, though biologically inspired, are in fact non-linear regression models.

• Non-linear relationships occur in many multi-dimensional data mining applications.


Profit

0

0 7 Years Early Loan Paid Off

Non-linear model

Linear model

An Example of a Regression Model


Outline





Bayesian Classification

We “discussed” Bayes Theorem last time…


Bayesian Classifiers

• Bayes Theorem: P(C|X) = P(X|C) P(C) / P(X) which states …

posterior = (likelihood x prior) / evidence • P(C) = prior probability = probability that any

given sample data is in class C, estimated before we have measured the sample data.

• We wish to determine the posterior probability P(C|X) that estimates whether C is the correct class for a given set of sample data X.


Estimating Bayesian Classifiers

• P(C|X) = P(X|C) P(C) / P(X) …– Estimate P(Cj) by counting the frequency of occurrence of

each class Cj in the training data set.*

– Estimate P(Xk) by counting the frequency of occurrence of each attribute value Xk in the data.*

– Estimate P(Xk | Cj) by counting how often the attribute value Xk occurs in class Cj in the training data set.*

– Calculate the desired end-result P(Cj | Xk) which is the classification = the probability that Cj is the correct class for a data item having attribute Xk.

(*Estimating these probabilities can be computationally very expensive for very large data sets.)


Example of Bayes Classification

• Show sample database

• Show application of Bayes theorem:– Use sample database as the “set of priors”

– Use Bayes results to classify new data


Example of Bayesian Classification :

• Suppose that you have a database D that contains characteristics of a large number of different kinds of cars that are sorted according to each car’s manufacturer = the car’s classification C.

• Suppose one of the attributes X in D is the car’s “color”.

• Measure P(C) from the frequency of different manufacturers in D.

• Measure P(X) from the frequency of different colors among the cars in D. (This estimate is made independent of manufacturer.)

• Measure P(X|C) from frequency of cars with color X made by manufacturer C.

• Okay, now you see a red car flying down the beltway. What is the car’s make (manufacturer)? You can estimate the likelihood that the car is from a given manufacturer C by calculating P(C|X) via Bayes Theorem:

– P(C|X) = P(X|C) P(C) / P(X) (Class is “C” when P(C|X) is a maximum.)

• With only one attribute, this is a trivial result, and not very informative. However, using a larger set of attributes (e.g., two-door, with sun roof) leads to a much better classification estimator : example of a Bayes Belief Network.


Sample Database for Bayes Classification Example

x = car colorC = class of car (manufacturer)

Car Database:

Tuple x C 1 red honda2 blue honda3 white honda4 red chevy5 blue chevy6 white chevy7 red toyota8 white toyota9 white toyota10 red chevy11 white ford12 white ford13 blue ford14 red chevy15 red dodge

Some statistical results:

x1 = red P(x1) = 6/15x2 = white P(x2) = 6/15x3 = blue P(x3) = 3/15

C1 = chevy P(C1) = 5/15C2 = honda P(C2) = 3/15C3 = toyota P(C3) = 3/15C4 = ford P(C4) = 3/15C5 = dodge P(C5) = 1/15


Application #1 of Bayes Theorem• Recall the theorem: P(C|X) = P(X|C) P(C) / P(X)

• From last slide, we know P(C) and P(X). Calculate P(X|C) and then we can perform the classification.

P(C | red) = P(red | C) * P(C) / P(red)P(red | chevy) = 3/5P(red | honda) = 1/3P(red | toyota) = 1/3P(red | ford) = 0/3P(red | dodge) = 1/1

Therefore ...

P(chevy | red) = 3/5 * 5/15 * 15/6 = 3/6 = 50%P(honda | red) = 1/3 * 3/15 * 15/6 = 1/6 = 17%P(toyota | red) = 1/3 * 3/15 * 15/6 = 1/6 = 17%P(ford | red) = 0P(dodge | red) = 1/1 * 1/15 * 15/6 = 1/6 = 17%

Example #1:

We see a red car.

What type of car

is it?


Results from Bayes Example #1

• Therefore, the red car is most likely a Chevy (maybe a Camaro or Corvette? ).

• The red car is unlikely to be a Ford.

• We choose the most probable class as the Classification of the new data item (red car): therefore, Classification = C1 (Chevy).


Application #2 of Bayes Theorem

• Recall the theorem: P(C|X) = P(X|C) P(C) / P(X)

P(C | white) = P(white | C) * P(C) / P(white)P(white | chevy) = 1/5P(white | honda) = 1/3P(white | toyota) = 2/3P(white | ford) = 2/3P(white | dodge) = 0/1

Therefore ...

P(chevy | white) = 1/5 * 5/15 * 15/6 = 1/6 = 17%P(honda | white) = 1/3 * 3/15 * 15/6 = 1/6 = 17%P(toyota | white) = 2/3 * 3/15 * 15/6 = 2/6 = 33%P(ford | white) = 2/3 * 3/15 * 15/6 = 2/6 = 33%P(dodge | white) = 0

Example #2:

We see a white car.

What type of car

is it?


Results from Bayes Example #2

• Therefore, the white car is equally likely to be a Ford or a Toyota.

• The white car is unlikely to be a Dodge.

• If we choose the most probable class as the Classification, we have a tie. You can either pick one of the two classes randomly (if you must pick). Or else weight each class 0.50 in the output classification (C3, C4), if a probabilistic classification is permitted.


Intuitive Interpretation of Bayes Theorem

• Recall the theorem, for a given attribute Xk and class Cj:

– P(Cj | Xk) = P(Xk | Cj) P(Cj) / P(Xk)

• If P(Xk | Cj) is small, then it is very unlikely that class Cj will ever

produce an attribute value Xk , and so P(Cj | Xk) must likewise be small

since the class is unlikely to be Cj when we have Xk.

• Similarly, if P(Cj) is small, then P(Cj | Xk) must be small since the class

is unlikely to be Cj under any circumstance.

• Finally, if a given attribute Xk is very rare in the training database, then

P(Xk) will be very small, and therefore the value of P(Cj | Xk) will be

very large [as long as the values of P(Cj) and P(Xk | Cj) are not small].

This makes sense, because if the rare value Xk is seen at all, then it must

imply that the class is very likely to be Cj .


Why Use Bayesian Classification?

• Probabilistic Learning: Allows you to calculate explicit probabilities for a hypothesis -- “learn as you go”. This is among the most practical approaches to certain types of learning problems (e.g., e-mail Spam detection).

• Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct.

• Data-Driven: Prior knowledge can be combined with observed data.

• Probabilistic Prediction: Allows you to predict multiple hypotheses, each weighted by their own probabilities.

• The Standard: Bayesian methods provide a standard of optimal decision-making against which other methods can be compared.


Naïve Bayesian Classification

• Naïve Bayesian Classification assumes that all classes

C(i) are independent of one another.

• Naïve Bayes assumption: attribute independence

P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)

(= a simple product of probabilities)

• P(xi|C) is estimated as the relative frequency of samples

in class C for which their attribute “i” has the value “xi”.

• This assumes that there is no correlation in the attribute

values x1,…,xk (attribute independence)


The Independence Hypothesis…

• … makes the computation possible (tractable)

• … yields optimal classifiers when satisfied

• … but is seldom satisfied in practice, as attributes

(variables) are often correlated.

• Some approaches to overcome this limitation:

– Bayesian networks, that combine Bayesian reasoning

with causal relationships between attributes

– Decision trees, that reason on one attribute at a time,

considering most important attributes first


Another Bayes Classification Example: SPAM


E-mail volume explosion parallels the general data volume explosion!

http://www.computerworld.com/printthis/2003/0,4814,86632,00.html

• Quotes from the article:– “In 2002, people around the globe created enough new

information to fill 500,000 U.S. Libraries of Congress.– The 5 billion GB of new data works out to about 800MB

per person -- the equivalent of a stack of books 9m high.– In addition to looking at stored data, UC Berkeley

measured electronic flows of new information at 18 billion GBytes in 2002.

– Whether that information has any value is another question.”

• Therefore, we need intelligent tools to filter the noise (spam).


Bayes Classification for Spam Detection and Removal:This Email Classified as Spam – Example #1

---- Start SpamAssassin results5.60 points, 5 required;* 0.2 -- BODY: Offers a limited time offer* 1.9 -- BODY: Save big money* 0.6 -- BODY: No such thing as a free lunch (1)* 0.8 -- BODY: Stop with the offers, coupons, discounts etc!* 0.1 -- BODY: Tells you how to stop further spam* 0.1 -- BODY: HTML font color is red* 0.1 -- BODY: Image tag with an ID code to identify you* 0.1 -- BODY: HTML font color is gray* 0.1 -- BODY: HTML included in message* 0.5 -- BODY: Message is 50% to 60% HTML* 0.7 -- URI: Uses a dotted-decimal IP address in URL* 0.3 -- URI: 'remove' URL contains an email address* 0.1 -- Headers include an "opt"ed phrase

---- End of SpamAssassin results

Here are theresults!


Bayes Classification for Spam Detection and Removal: This Email Classified as Spam – Example #2

---- Start SpamAssassin results13.10 points, 5 required;* 0.8 -- From: does not include a real name* 1.7 -- Subject contains lots of white space* 0.0 -- Subject talks about savings* 2.9 -- BODY: Message seems to contain obscured email address (rot13)* 2.1 -- BODY: Claims you registered with some kind of partner* 0.8 -- BODY: Stop with the offers, coupons, discounts etc!* 1.7 -- BODY: Contains "Toner Cartridge"* 0.1 -- BODY: HTML font color is gray* 0.2 -- BODY: HTML has unbalanced "body" tags* 0.1 -- BODY: HTML included in message* 1.4 -- BODY: Message is 10% to 20% HTML* 1.3 -- Subject contains a unique ID

---- End of SpamAssassin results

Here are theresults!


Classification of E-mail by another method: Support Vector Machines

(SVM)


Outline





Decision Trees 101

• We discussed Decision Trees earlier -- Summary:– Has a flow-chart-like tree structure

– Each internal node denotes a test on an attribute

– Each branch represents an outcome of a test

– Leaf nodes represent class labels or class distribution

• Using a decision tree: – Classify an unknown sample

– Test attribute values of the sample against the decision tree

• Issues in building decision trees:– How do you start?

– When do you stop?

• Specific Algorithms: ID3, C4.5, C.5, CART


Decision Tree Terminology

The Root Node

branches

internal node internal node

Leaf Nodes


Building a Decision Tree

• Decision tree generation consists of two phases:– Tree construction

• At start, all the training examples are at the root.• Partition the examples top-down recursively

based on selected attributes, using a “divide and conquer” approach.

– Tree pruning• Identify and remove branches that reflect noise or

“uninteresting” (insignificant) subclasses.


Terminating the Tree• Conditions for stopping the tree partitioning:

– All samples at a given node already belong to the same class.

– There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf (= terminating node = the classification!).

– There are no samples left.

– The goodness measure drops below a preset threshold (this measure is determined according to different algorithms in different decision tree implementations).


How to Avoid Overfitting (“fitting the noise”)

• The generated tree may overfit the training data: – Too many branches -- some may reflect anomalies due to

noise or outliers– Results in poor classification accuracy for unseen samples

• Two approaches to avoid overfitting:– Prepruning -- Halt tree construction early -- do not split a

node if this would result in the “goodness measure” falling below a threshold (i.e., insufficient training examples)

• Problem: Difficult to choose an appropriate threshold– Postpruning -- Remove branches from a “fully grown”

tree -- get a sequence of progressively pruned trees• Use a set of data different from the training data to

decide which is the “best pruned tree”


Why Use Decision Trees for Data Mining?

• can use SQL queries for accessing the databases

• can be constructed relatively faster than other methods

• relatively faster learning speed (compared to other classification methods)

• the tree is ultimately convertible to a set of simple and easy to understand classification rules

• due to their intuitive graphical representation, they are easy to assimilate by humans

• accuracy of decision tree classifiers is comparable or superior to other models


Outline





Decision Tree Algorithms - 1

• Two steps: – Decision tree induction (construct tree from training data)– Application of DT to each tuple in database to find class

• Refer to Definition 4.3 in Dunham text (page 93) -- be sure to understand this. – Note that the branches of the tree are sometimes

referred to as arcs.– Each arc is labeled with a predicate (an action) that is

applied to the attribute associated with the parent node.– Each internal node is a decision point, labeled by the

attribute being tested at that point.


Decision Tree Algorithms - 2• ID3 approach:

– “Iterative Dichtomiser” (R.Quinlan 1979)– Picks predictors (nodes) and their splitting values

(predicates) on the basis of an information gain metric:• The difference between the amount of information that is needed

to make the correct prediction both before and after the split has been made.

• If the amount of information required is much lower after the split is made, then the split is said to have “decreased the disorder of the original data”. This is good (i.e., the more ordered the data, then the more certain is our final classification.)

– Refer to the “20 Questions” example in the textbook (page 97) -- good questions provide good DT splits: adult asks “Is it alive?”, but child asks “Is it my daddy?”.


Decision Tree Algorithms - 3• C4.5 and C5.0 approaches -- introduce numerous

improvements and extensions to ID3:– Handles missing data in training set– Handles continuous data (not just categorical data)– Improved tree pruning (subtree replacement and subtree

raising; depending on acceptable error rates) – Automated rule generation (allows for classification by the

DT or by the rules alone)– ID3 tends to overfit (“a split for every attribute, an attribute

for every split”), while C4.5 and C5.0 improve the information gain at each split.

– C5.0 is the commercial version of C4.5, with proprietary rule generation algorithms. Targeted to large datasets.


Decision Tree Algorithms - 4• CART approach (Classification And Regression

Trees):– Generates binary decision tree: only 2 children created

at each node (whereas ID3 creates a child for each subcategory).

– Time-consuming: search is made at each split to find the best binary split. Uses entropy from Shannon’s Information Theory:

• The measure of disorder is also called the Entropy…

• where pi is the probability of that value occurring at a node.


Decision Tree Algorithms - 5• CHAID approach (Chi-squared Automatic

Interaction Detector):– distributed in the popular SAS and SPSS statistical

packages– similar to CART, except that CHAID uses chi-squared

test instead of entropy test to identify split points, to find best independent variables

– all predictors must be categorical, or put into categorical form through binning the data (i.e., no continuous data)

– accuracy of CART and CHAID are similar– CHAID is part of large family: AID, THAID, MAID, XAID


Outline





Other Classification Methods

• K-nearest neighbor (KNN) classifier

• Case-based reasoning (CBR)

• Artificial Neural Networks (ANN)

• Genetic Algorithms (GA)

• Rough set approach

• Fuzzy set approaches

• … more about some of these next time …



Outline





A Classification Example

• “Sorting incoming Fish on a conveyor according to species using optical sensing”

Sea bass

Species

Salmon


Problem Analysis

• Set up a camera and take some sample images to extract features. Possible features may include:

• Length• Lightness• Width• Number and shape of fins• Position of the mouth, etc…

• This is the set of all suggested features that you will need to explore for possible use in our classifier!


• Preprocessing– Use a segmentation operation on the images to

isolate fish images from one another and from the background

• Information from a single fish is sent to a feature extractor whose purpose is to reduce the data by measuring certain features

• The features are passed to a classifier



• Classification

– Select the length of the fish as a possible feature for discrimination



What do we learn?

The length of the fish is a poor classification feature by itself! (previous slide)

Select the lightness of the fish’s color as a possible classification feature. (next slide)


This attribute still does not cleanly separate the two classes


Threshold decision boundary and cost relationship

– Move our decision boundary toward smaller values of lightness in order to minimize the cost (reduce the number of sea bass that are misclassified as salmon!)

Task of decision theory


• Adopt the lightness and add the width of the fish in some combined (transformed) variable:

Fish xT = [x1, x2]

Lightness Width


This sloped line represents a reasonable Classifier


• We might add other features that are not correlated with the ones we already have. A precaution should be taken not to reduce the performance by adding such “noisy features” (i.e., we must avoid overfitting the data).

• Ideally, the best decision boundary should be the one which provides an optimal performance such as in the following figure:


This curve is a classic example of overfitting (bad!):

It might be the result of applying the SVM algorithm(SVM = Support Vector Machine)


• However, our satisfaction is premature because the central aim of designing a classifier is to correctly classify novel input

Issue of generalization!(avoiding overfitting!)


This curved dividing line is now a good classifier


What about Misclassifications?

(false positives and false negatives)

Sea Bass misclassified as Salmon

Salmon misclassified as Sea Bass


Cost-sensitive Classification

• Penalize misclassifications of one class more than the other

• Changes decision boundaries


New decision boundary

x*x*

New decision boundary

What if… Salmon is more expensive than Bass?What if… Bass is more expensive than Salmon?


Summary


Summary of Topics Covered - Week 4




Documents

Lecture 4