75
Data Mining using the Data Mining using the Enterprise Miner Enterprise Miner J. Michael Hardin, Ph.D. Professor of Statistics

Data Mining using the Enterprise Miner

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Mining using the Enterprise Miner

Data Mining using the Data Mining using the Enterprise MinerEnterprise Miner

J. Michael Hardin, Ph.D.Professor of Statistics

Page 2: Data Mining using the Enterprise Miner

Where Are We Going?Where Are We Going?

Outline Outline • What is Data mining?What is Data mining?••Overview of the Enterprise MinerOverview of the Enterprise Miner

••Transformations, Outliers, Missing Transformations, Outliers, Missing Values, and Variable SelectionValues, and Variable Selection••VisualizationVisualization••Data Mining TechnologiesData Mining TechnologiesØDecision TreesØRegression AnalysisØNeural NetworksØCluster AnalysisØAssociation Analysis

Page 3: Data Mining using the Enterprise Miner

What is What is Data Mining?Data Mining?

Page 4: Data Mining using the Enterprise Miner

What is Data Mining?What is Data Mining?

Insights from Dilbert

Page 5: Data Mining using the Enterprise Miner

Further Insights form Dilbert

Page 6: Data Mining using the Enterprise Miner

Data Mining

Page 7: Data Mining using the Enterprise Miner

KDD DefinitionKDD Definition

The non-trivial processprocess of identifying validvalid, novel, potentially usefuluseful, and ultimately understandableunderstandable patterns in the data

Ex. From Census Bureau data:If Relationship=Husband then sex=male

(prob=.996)

Fayyad, Piatetsky-Shapiro, Smyth (1996)

Page 8: Data Mining using the Enterprise Miner

What is Data Mining?What is Data Mining?

• Data Mining is the process of selecting, exploring, and modeling large amounts of data to uncover previously unknown patterns that can be exploited for business advantage

• A business process which uses a range of computer technologies to learn from the past, turning data into actionableknowledge

Page 9: Data Mining using the Enterprise Miner

What is Data Mining?

ITComplicated database queries

ML

Inductive learning from examples

Stat

What Statisticians were What Statisticians were taught NOT to do!taught NOT to do!

Page 10: Data Mining using the Enterprise Miner

Data Mining has emerged from a Data Mining has emerged from a Multidisciplinary BackgroundMultidisciplinary Background

DatabasesDatabases

StatisticsStatistics

PatternPatternRecognitionRecognition

KDD

MachineLearning AI

NeurocomputingNeurocomputing

Data Mining

Page 11: Data Mining using the Enterprise Miner

Tower of Babel

MACHINE LEARNING: A reason for favoringany model that does not fit the data perfectly.

NEUROCOMPUTING: Theconstant term in a linearcombination.

STATISTICS: The expecteddifference between anestimator and what isbeing estimated.

“Bias”

Page 12: Data Mining using the Enterprise Miner

ReferenceReference

• Authors: James Myers and Edward Forgy• Title: The Development of Numerical

Credit Evaluation Systems• Publication: Journal of the American

Statistical Association

• Date: September,

Page 13: Data Mining using the Enterprise Miner

Nuggets

— Herb Edelstein

“If you’ve got terabytes of data,

and your relying on

data mining to find

interesting things

in there for you,

you’ve lost

before you’ve even begun.”

Page 14: Data Mining using the Enterprise Miner

Statistics and Data Mining

Recent reflections on data mining and statistics:

David HandJerome FriedmanPadhraic SmythLeo Breiman

Page 15: Data Mining using the Enterprise Miner

Statistics and Data Mining (cont)

Some key issues:

Data dredging, fishing, data snooping

Looking at the data, exploratory data analysis (EDA), and the scientific method

Primary .vs. Secondary data analysis

Large data sets, observational data, selection bias

Model selection, model uncertainty*

Page 16: Data Mining using the Enterprise Miner

Statistics and Data Mining (cont)

Some key issues:

P-values, estimation .vs. prediction, classification, generalizability

Single data analysis set .vs. data splitting (validation, test data sets) *

Local .vs. global structure

“…classification error responds to error in …probability estimates in a much different (and perhaps less intuitive) way than squared estimation error. This helps explain why improvements to the latter do not necessarily lead to improved classification performance, and why simple methods … remain competitive, even though they usually provide poor estimates of the true probabilities (Friedman, 1997)

Page 17: Data Mining using the Enterprise Miner

Statistics and Data Mining (cont)

Some key issues:

Two cultures in analysis of data:Data modeling

Parameters are estimated

Model is validated via goodness-of-fit and residual examination

Algorithmic modeling

Construct algorithm that predicts response

Model validation by predictive accuracy

Brieman, L, (2001) “Statistical Modeling: The Two Cultures”, Statistical Science, (16), 199-231.

Page 18: Data Mining using the Enterprise Miner

Overview of Data Mining/KDD Process

Creating a target set of data

Data cleaning and pre-processing

Data reduction and projection

Apply Data mining techniques

Evaluation and interpretation

Refinement of earlier steps based on evaluation and interpretation

Page 19: Data Mining using the Enterprise Miner

Other Data Mining Process Names

SEMMA (SAS)

SSample

EExplore

MModify

MModel

AAssess

CRISP-DM (CRCRoss-IIndustry SStandard PProcess for DData MMining)

Page 20: Data Mining using the Enterprise Miner

Data Mining Process

Model Management

Scoring

Identify Data Requirements

Obtain Data

Validate, ExploreClean Data

Transpose Data

Choose Best Model

Assessment Evaluate Model(s)

Train Model

Choose Modeling Technique

Create Model Set

Add Derived Variables

Page 21: Data Mining using the Enterprise Miner

Overview of theOverview of the

Enterprise MinerEnterprise Miner

Page 22: Data Mining using the Enterprise Miner

Enterprise Miner Interface

EM Tools Bar

DiagramWorkspace

Current ProjectDiagram Tools

Result Summaries

Project Navigator

Page 23: Data Mining using the Enterprise Miner

Demonstration

This demonstration illustrates: Creating a client-only project

Accessing raw modeling data

Transformations

Outliers

Data replacement

Visualizations

Page 24: Data Mining using the Enterprise Miner

Example Data Set 1 – Pima Indians Diabetes Database

National Institute of Diabetes and Digestive Kidney Disease

Vincent Sigillito, John Hopkins

Summary:The diagnostic, binary-valued variable investigated is

whether the patient shows signs of diabetes according to World Health Organization criteria (i.e., if the 2 hour post-load plasma glucose was at least 200 mg/dl at any survey examination or if found during routine medical care). The population lives near Phoenix, Arizona, USA.

Page 25: Data Mining using the Enterprise Miner

Number of Case: 768

Number of Variables: 8 plus target variable

Variables:1. Number of times pregnant 2. Plasma glucose concentration a 2 hours in an oral

glucose tolerance test 3. Diastolic blood pressure (mm Hg) 4. Triceps skin fold thickness (mm) 5. 2-Hour serum insulin (mu U/ml) 6. Body mass index (weight in kg/(height in m)^2) 7. Diabetes pedigree function 8. Age (years) 9. Class variable (0 or 1) (target variable)

Example Data Set 1 – Pima Indians Diabetes Database

Page 26: Data Mining using the Enterprise Miner

Data Mining Technologies

Page 27: Data Mining using the Enterprise Miner

Data Mining Technologies

Supervised Learning (Predictive Modeling)

Logistic Regression

Neural Networks

Decision Trees

Unsupervised Learning

Cluster Analysis

Association Analysis

Page 28: Data Mining using the Enterprise Miner

Supervised Classification

y x2 x3 x4 x5 x6 ... xk

1

2

3

5...n

4

x1

......

......

......

...

...

...

...

...

...

...

...

Input Variables

Cases

(Binary) Target

Page 29: Data Mining using the Enterprise Miner

Generalization

x2 x3 x4 x5 x6 ... xk

1

2

3

5...

>n

4

x1

......

......

......

...

...

...

...

...

...

...

Input Variables

NewCases

Unknown

Page 30: Data Mining using the Enterprise Miner

Mixed Measurement Scales

sales, executive, homemaker, ...

88.60, 3.92, 34890.50, 45.01, ...

0, 1, 2, 3, 4, 5, 6, ...

F, D, C, B, A

27513, 21737, 92614, 10043, ...

M, F

Page 31: Data Mining using the Enterprise Miner

Types of Targets

Supervised ClassificationEvent/no event (binary target)

Class label (multiclass problem)

RegressionContinuous outcome

Survival AnalysisTime-to-event (possibly censored)

Page 32: Data Mining using the Enterprise Miner

Modeling Methods

GeneralizedLinear Models

NeuralNetworks

DecisionTrees

Page 33: Data Mining using the Enterprise Miner

Logistic Regression

Page 34: Data Mining using the Enterprise Miner

Functional Form

kikii xxp β++β+β= L110)logit(

posterior probability

parameterinput

Page 35: Data Mining using the Enterprise Miner

The Logit Link Function

η−+=⇔η=

=e

pp

pp i

i

ii 1

11

ln)logit(

smaller ← η → larger

pi = 1

pi = 0

Page 36: Data Mining using the Enterprise Miner

The Fitted Surface

logit(p) p

1

0x1

x2 x1x2

0

Page 37: Data Mining using the Enterprise Miner

Logistic Discrimination

0

1

x1x2

p

x1

x2

above

below

Page 38: Data Mining using the Enterprise Miner

Scoring New Cases

05.ˆ =p

)0.3,1.1(=x

21 50.14.6.1)ˆ(logit xxp +−=

Page 39: Data Mining using the Enterprise Miner

DemonstrationDemonstration

Page 40: Data Mining using the Enterprise Miner

Artificial Neural Networks

Neuron

Hidden Unit

Page 41: Data Mining using the Enterprise Miner

Multilayer Perceptron

Hidden Layers

Output LayerInputLayer

Hidden Unit

Page 42: Data Mining using the Enterprise Miner

Activation FunctionLayer

Inp

ut

Page 43: Data Mining using the Enterprise Miner

Historical Background

Rosenblatt, F. (1958), “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain”, Psychological Review, (65), 1958.

Page 44: Data Mining using the Enterprise Miner

Historical Background

Ackerly, D.H., G.E. Hinton, and T.J. Sejnowski (1985), “A learning algorithm for Boltzmann Machines”, Cognitive Science, (9), 147-169.

Page 45: Data Mining using the Enterprise Miner

= + + +0 1 1 2 2 3 3( )E y w w x w x w x

x1

x2

x3

y

w2

w3

w1

(Multiple) Linear Regression

Page 46: Data Mining using the Enterprise Miner

= + + + −

0 1 1 2 2 3 3( )

ln1 ( )

E yw w x w x w x

E y

Logistic Regression

x1

x2

x3

y

w2

w3

w1

Page 47: Data Mining using the Enterprise Miner

w1

w2

x1

x2

x3

y

w21

w22 w31

w32

w11

w12

= + + +2 2 02 12 1 22 2 32 3( )H g w w x w x w x

= + + +1 1 01 11 1 21 2 31 3( )H g w w x w x w x

H1

H2

− = + +10 0 1 1 2 2( ( ))g E y w w H w H

Feed-Forward Neural Network

Page 48: Data Mining using the Enterprise Miner

= + + +2 02 12 1 22 2 32 3tanh( )H w w x w x w x

= + + +1 01 11 1 21 2 31 3tanh( )H w w x w x w x

− = + +10 0 1 1 2 2( ( ))g E y w w H w H

Multilayer Perceptron

Page 49: Data Mining using the Enterprise Miner

Generalized Linear Models

Output LayerInputLayer

Page 50: Data Mining using the Enterprise Miner

Generalized Linear Model

x1

x2

x3

y

w2

w3

w1

− = + + +10 0 1 1 2 2 3 3( ( ))g E y w w x w x w x

Page 51: Data Mining using the Enterprise Miner

Output Activation Function

10 0( ( )) ( , ) ( ) ( ( , ))g E y E y gµ µ− = ⇔ =x w x w

Inverse output activation function= link function

Page 52: Data Mining using the Enterprise Miner

−10 ( ( ))g E y

Identity

Logit

Log

( )E y Range

( )ln

1 ( )E y

E y ( , )

11 e µ−+ x w

( )E y ( , )µ x w

ln( ( ))E y ( , )eµ x w

−∞ +∞( ; )

(0; 1)

+∞(0; )

Link Functions

Page 53: Data Mining using the Enterprise Miner

Link Function Inventory

Linkidentityloglogitgeneralized logitcumulative logit

Output Act.identityexponentiallogisticsoftmaxlogistic

Scaleintervalnonnegativebinarypolychotomousordinal

Page 54: Data Mining using the Enterprise Miner

Universal Approximation

+ ⋅3w

−10g = + ⋅0 1w w

+ ⋅4w + ⋅5w

+ ⋅2w

Page 55: Data Mining using the Enterprise Miner

Neural Network ≠ Backpropagation

Model

Data

FittedModel

Training

Page 56: Data Mining using the Enterprise Miner

Practical Difficulties

Troublesome Training

Model Complexity/Specification

Incomprehensibility

Unreasonable Expectations

Anthropomorphism

Noisy data

Data preparation

( ) ˆ,y y→ →x

Page 57: Data Mining using the Enterprise Miner

“My CPU is a neural-netprocessor… a learningcomputer”

“My CPU is a neural-netprocessor… a learningcomputer”

“My CPU fitsregression modelsto data”

“My CPU fitsregression modelsto data”

Page 58: Data Mining using the Enterprise Miner

DemonstrationDemonstration

Page 59: Data Mining using the Enterprise Miner

The Cultivation of Trees

Split Search Which splits are to be considered?

Splitting Criterion

Which split is best?

Stopping RuleWhen should the splitting stop?

Pruning RuleShould some branches be lopped-off?

Page 60: Data Mining using the Enterprise Miner

A Field Guide to Tree Algorithms

CART

AIDTHAIDCHAID

ID3C4.5C5.0

Page 61: Data Mining using the Enterprise Miner

…Benefits

Automatically

Detects interactions (AID)

Accommodates nonlinearity

Selects input variables

Ease of interpretation

InputInput

Prob

MultivariateStep Function

Page 62: Data Mining using the Enterprise Miner

Drawbacks of Trees

Roughness

Linear, Main Effects

Instability

Page 63: Data Mining using the Enterprise Miner

DemonstrationDemonstration

Page 64: Data Mining using the Enterprise Miner

Unsupervised Classification

case 1: inputs, ?case 2: inputs, ?case 3: inputs, ? case 4: inputs, ? case 5: inputs, ?

Training Data

new case

new case

case 1: inputs, cluster 1case 2: inputs, cluster 3case 3: inputs, cluster 2case 4: inputs, cluster 1case 5: inputs, cluster 2

Training Data

Page 65: Data Mining using the Enterprise Miner

K-means Clustering

Page 66: Data Mining using the Enterprise Miner

Final Grouping

Page 67: Data Mining using the Enterprise Miner

Areas of Applications

GenomicsMicro-Array

Others

Nursing Home Staff Management

Many others

Page 68: Data Mining using the Enterprise Miner

Demonstration Demonstration

Page 69: Data Mining using the Enterprise Miner

Association Rules

RuleA ⇒ DC ⇒ AA ⇒ C

B & C ⇒ D

Support2/52/52/51/5

Confidence2/32/42/31/3

A B C A C D B C D A D E B C E

Page 70: Data Mining using the Enterprise Miner

Occupational Epidemiology

Identifying Risk patterns in Employment histories

Association Analysis

Employee is “basket”, events during tenure are “items”

Page 71: Data Mining using the Enterprise Miner

UAB Data Mining and Knowledge Discovery Research Group

Warren T. Jones1, J. Michael Hardin2, 3, Alan P. Spague1, Stephen E. Brossette1, and Stephen Moser4

1Department of Computer Science2Department of Health Informatics

3Department of Biostatistics4Department of Pathology

Page 72: Data Mining using the Enterprise Miner

Data Mining Surveillance System (DMSS)

A Knowledge Discovery System for Epidemiology

Stephen E. Brossette, J. Michael Hardin, Warren T. Jones, Alan P. Spague, and Stephen Moser

Page 73: Data Mining using the Enterprise Miner

A Strategy for Geomedical Surveillance Usingthe Hawkeye Knowledge Discovery System

Daisy Y. Wong 3, Warren T. Jones 3, Stephen E. Brossette 3,

J. Michael Hardin 2 and Stephen A. Moser 1

Departments of Pathology 1, Biostatistics 2 , Health Informatics2, Computer and Information Sciences 3

University of Alabama at Birmingham

USA

Page 74: Data Mining using the Enterprise Miner

Working Interpretation

ICP

Infection Control

Data

Data Acquisition

Knowledge

Data Selection/ Preparation

Data Mining Engine

(Hawkeye)

Output

Moderator

ICCChair

ID/MD

New Patterns

Expert Interpretation

Users in hospital

Gate keeper

Approved data for global sharing

Data from external sources

Data

Lab

A Local Site Model for Global CollaborationA Local Site Model for Global CollaborationOutsidesharable data

Page 75: Data Mining using the Enterprise Miner

Thank You!Thank You!

Questions?Questions?