Data Mining using the Enterprise Miner

Data Mining using the Data Mining using the Enterprise MinerEnterprise Miner

J. Michael Hardin, Ph.D.Professor of Statistics

Where Are We Going?Where Are We Going?

Outline Outline • What is Data mining?What is Data mining?••Overview of the Enterprise MinerOverview of the Enterprise Miner

••Transformations, Outliers, Missing Transformations, Outliers, Missing Values, and Variable SelectionValues, and Variable Selection••VisualizationVisualization••Data Mining TechnologiesData Mining TechnologiesØDecision TreesØRegression AnalysisØNeural NetworksØCluster AnalysisØAssociation Analysis

What is What is Data Mining?Data Mining?

What is Data Mining?What is Data Mining?

Insights from Dilbert

Further Insights form Dilbert

Data Mining

KDD DefinitionKDD Definition

The non-trivial processprocess of identifying validvalid, novel, potentially usefuluseful, and ultimately understandableunderstandable patterns in the data

Ex. From Census Bureau data:If Relationship=Husband then sex=male

(prob=.996)

Fayyad, Piatetsky-Shapiro, Smyth (1996)

What is Data Mining?What is Data Mining?

• Data Mining is the process of selecting, exploring, and modeling large amounts of data to uncover previously unknown patterns that can be exploited for business advantage

• A business process which uses a range of computer technologies to learn from the past, turning data into actionableknowledge

What is Data Mining?

ITComplicated database queries

Inductive learning from examples

What Statisticians were What Statisticians were taught NOT to do!taught NOT to do!

Data Mining has emerged from a Data Mining has emerged from a Multidisciplinary BackgroundMultidisciplinary Background

DatabasesDatabases

StatisticsStatistics

PatternPatternRecognitionRecognition

MachineLearning AI

NeurocomputingNeurocomputing

Data Mining

Tower of Babel

MACHINE LEARNING: A reason for favoringany model that does not fit the data perfectly.

NEUROCOMPUTING: Theconstant term in a linearcombination.

STATISTICS: The expecteddifference between anestimator and what isbeing estimated.

“Bias”

ReferenceReference

• Authors: James Myers and Edward Forgy• Title: The Development of Numerical

Credit Evaluation Systems• Publication: Journal of the American

Statistical Association

• Date: September,

Nuggets

— Herb Edelstein

“If you’ve got terabytes of data,

and your relying on

data mining to find

interesting things

in there for you,

you’ve lost

before you’ve even begun.”

Statistics and Data Mining

Recent reflections on data mining and statistics:

David HandJerome FriedmanPadhraic SmythLeo Breiman

Statistics and Data Mining (cont)

Some key issues:

Data dredging, fishing, data snooping

Looking at the data, exploratory data analysis (EDA), and the scientific method

Primary .vs. Secondary data analysis

Large data sets, observational data, selection bias

Model selection, model uncertainty*

Some key issues:

P-values, estimation .vs. prediction, classification, generalizability

Single data analysis set .vs. data splitting (validation, test data sets) *

Local .vs. global structure

“…classification error responds to error in …probability estimates in a much different (and perhaps less intuitive) way than squared estimation error. This helps explain why improvements to the latter do not necessarily lead to improved classification performance, and why simple methods … remain competitive, even though they usually provide poor estimates of the true probabilities (Friedman, 1997)

Some key issues:

Two cultures in analysis of data:Data modeling

Parameters are estimated

Model is validated via goodness-of-fit and residual examination

Algorithmic modeling

Construct algorithm that predicts response

Model validation by predictive accuracy

Brieman, L, (2001) “Statistical Modeling: The Two Cultures”, Statistical Science, (16), 199-231.

Overview of Data Mining/KDD Process

Creating a target set of data

Data cleaning and pre-processing

Data reduction and projection

Apply Data mining techniques

Evaluation and interpretation

Refinement of earlier steps based on evaluation and interpretation

Other Data Mining Process Names

SEMMA (SAS)

SSample

EExplore

MModify

MModel

AAssess

CRISP-DM (CRCRoss-IIndustry SStandard PProcess for DData MMining)

Data Mining Process

Model Management

Scoring

Identify Data Requirements

Obtain Data

Validate, ExploreClean Data

Transpose Data

Choose Best Model

Assessment Evaluate Model(s)

Train Model

Choose Modeling Technique

Create Model Set

Add Derived Variables

Overview of theOverview of the

Enterprise MinerEnterprise Miner

Enterprise Miner Interface

EM Tools Bar

DiagramWorkspace

Current ProjectDiagram Tools

Result Summaries

Project Navigator

Demonstration

This demonstration illustrates: Creating a client-only project

Accessing raw modeling data

Transformations

Outliers

Data replacement

Visualizations

Example Data Set 1 – Pima Indians Diabetes Database

National Institute of Diabetes and Digestive Kidney Disease

Vincent Sigillito, John Hopkins

Summary:The diagnostic, binary-valued variable investigated is

whether the patient shows signs of diabetes according to World Health Organization criteria (i.e., if the 2 hour post-load plasma glucose was at least 200 mg/dl at any survey examination or if found during routine medical care). The population lives near Phoenix, Arizona, USA.

Number of Case: 768

Number of Variables: 8 plus target variable

Variables:1. Number of times pregnant 2. Plasma glucose concentration a 2 hours in an oral

glucose tolerance test 3. Diastolic blood pressure (mm Hg) 4. Triceps skin fold thickness (mm) 5. 2-Hour serum insulin (mu U/ml) 6. Body mass index (weight in kg/(height in m)^2) 7. Diabetes pedigree function 8. Age (years) 9. Class variable (0 or 1) (target variable)

Example Data Set 1 – Pima Indians Diabetes Database

Data Mining Technologies

Supervised Learning (Predictive Modeling)

Logistic Regression

Neural Networks

Decision Trees

Unsupervised Learning

Cluster Analysis

Association Analysis

Supervised Classification

y x2 x3 x4 x5 x6 ... xk

......

Input Variables

(Binary) Target

Generalization

x2 x3 x4 x5 x6 ... xk

......

Input Variables

NewCases

Unknown

Mixed Measurement Scales

sales, executive, homemaker, ...

88.60, 3.92, 34890.50, 45.01, ...

0, 1, 2, 3, 4, 5, 6, ...

F, D, C, B, A

27513, 21737, 92614, 10043, ...

Types of Targets

Supervised ClassificationEvent/no event (binary target)

Class label (multiclass problem)

RegressionContinuous outcome

Survival AnalysisTime-to-event (possibly censored)

Modeling Methods

GeneralizedLinear Models

NeuralNetworks

DecisionTrees

Logistic Regression

Functional Form

kikii xxp β++β+β= L110)logit(

posterior probability

parameterinput

The Logit Link Function

η−+=⇔η=

ln)logit(

smaller ← η → larger

pi = 1

pi = 0

The Fitted Surface

logit(p) p

x2 x1x2

Logistic Discrimination

Scoring New Cases

05.ˆ =p

)0.3,1.1(=x

21 50.14.6.1)ˆ(logit xxp +−=

DemonstrationDemonstration

Artificial Neural Networks

Neuron

Hidden Unit

Multilayer Perceptron

Hidden Layers

Output LayerInputLayer

Hidden Unit

Activation FunctionLayer

Historical Background

Rosenblatt, F. (1958), “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain”, Psychological Review, (65), 1958.

Historical Background

Ackerly, D.H., G.E. Hinton, and T.J. Sejnowski (1985), “A learning algorithm for Boltzmann Machines”, Cognitive Science, (9), 147-169.

= + + +0 1 1 2 2 3 3( )E y w w x w x w x

(Multiple) Linear Regression

= + + + −

0 1 1 2 2 3 3( )

ln1 ( )

E yw w x w x w x

Logistic Regression

w22 w31

= + + +2 2 02 12 1 22 2 32 3( )H g w w x w x w x

= + + +1 1 01 11 1 21 2 31 3( )H g w w x w x w x

− = + +10 0 1 1 2 2( ( ))g E y w w H w H

Feed-Forward Neural Network

= + + +2 02 12 1 22 2 32 3tanh( )H w w x w x w x

= + + +1 01 11 1 21 2 31 3tanh( )H w w x w x w x

− = + +10 0 1 1 2 2( ( ))g E y w w H w H

Multilayer Perceptron

Generalized Linear Models

Output LayerInputLayer

Generalized Linear Model

− = + + +10 0 1 1 2 2 3 3( ( ))g E y w w x w x w x

Output Activation Function

10 0( ( )) ( , ) ( ) ( ( , ))g E y E y gµ µ− = ⇔ =x w x w

Inverse output activation function= link function

−10 ( ( ))g E y

Identity

( )E y Range

1 ( )E y

E y ( , )

11 e µ−+ x w

( )E y ( , )µ x w

ln( ( ))E y ( , )eµ x w

−∞ +∞( ; )

(0; 1)

+∞(0; )

Link Functions

Link Function Inventory

Linkidentityloglogitgeneralized logitcumulative logit

Output Act.identityexponentiallogisticsoftmaxlogistic

Scaleintervalnonnegativebinarypolychotomousordinal

Universal Approximation

+ ⋅3w

−10g = + ⋅0 1w w

+ ⋅4w + ⋅5w

+ ⋅2w

Neural Network ≠ Backpropagation

FittedModel

Training

Practical Difficulties

Troublesome Training

Model Complexity/Specification

Incomprehensibility

Unreasonable Expectations

Anthropomorphism

Noisy data

Data preparation

( ) ˆ,y y→ →x

“My CPU is a neural-netprocessor… a learningcomputer”

“My CPU fitsregression modelsto data”

The Cultivation of Trees

Split Search Which splits are to be considered?

Splitting Criterion

Which split is best?

Stopping RuleWhen should the splitting stop?

Pruning RuleShould some branches be lopped-off?

A Field Guide to Tree Algorithms

AIDTHAIDCHAID

ID3C4.5C5.0

…Benefits

Automatically

Detects interactions (AID)

Accommodates nonlinearity

Selects input variables

Ease of interpretation

InputInput

MultivariateStep Function

Drawbacks of Trees

Roughness

Linear, Main Effects

Instability

Unsupervised Classification

case 1: inputs, ?case 2: inputs, ?case 3: inputs, ? case 4: inputs, ? case 5: inputs, ?

Training Data

new case

case 1: inputs, cluster 1case 2: inputs, cluster 3case 3: inputs, cluster 2case 4: inputs, cluster 1case 5: inputs, cluster 2

Training Data

K-means Clustering

Final Grouping

Areas of Applications

GenomicsMicro-Array

Others

Nursing Home Staff Management

Many others

Demonstration Demonstration

Association Rules

RuleA ⇒ DC ⇒ AA ⇒ C

B & C ⇒ D

Support2/52/52/51/5

Confidence2/32/42/31/3

A B C A C D B C D A D E B C E

Occupational Epidemiology

Identifying Risk patterns in Employment histories

Association Analysis

Employee is “basket”, events during tenure are “items”

UAB Data Mining and Knowledge Discovery Research Group

Warren T. Jones1, J. Michael Hardin2, 3, Alan P. Spague1, Stephen E. Brossette1, and Stephen Moser4

1Department of Computer Science2Department of Health Informatics

3Department of Biostatistics4Department of Pathology

Data Mining Surveillance System (DMSS)

A Knowledge Discovery System for Epidemiology

Stephen E. Brossette, J. Michael Hardin, Warren T. Jones, Alan P. Spague, and Stephen Moser

A Strategy for Geomedical Surveillance Usingthe Hawkeye Knowledge Discovery System

Daisy Y. Wong 3, Warren T. Jones 3, Stephen E. Brossette 3,

J. Michael Hardin 2 and Stephen A. Moser 1

Departments of Pathology 1, Biostatistics 2 , Health Informatics2, Computer and Information Sciences 3

University of Alabama at Birmingham

Working Interpretation

Infection Control

Data Acquisition

Knowledge

Data Selection/ Preparation

Data Mining Engine

(Hawkeye)

Output

Moderator

ICCChair

New Patterns

Expert Interpretation

Users in hospital

Gate keeper

Approved data for global sharing

Data from external sources

A Local Site Model for Global CollaborationA Local Site Model for Global CollaborationOutsidesharable data

Thank You!Thank You!

Questions?Questions?

Data Mining using the Enterprise Miner

Documents

Extending the Data Mining Software Packages SAS Enterprise Miner

Link Analysis Using SAS Enterprise Miner™support.sas.com/rnd/app/data-mining/enterprise... · Link analysis is the data mining technique that addresses this need. In SAS Enterprise

SAS Enterprise Miner: An Overview - IASRIiasri.res.in/ebook/win_school_aa/notes/eminer_intro...SAS Enterprise Miner: An Overview Introduction: SAS Enterprise Miner support the entire

SAS ENTERPRISE MINER 5 · PDF fileSAS® Enterprise Miner ... SAS® ENTERPRISE MINER™ 5.3 Unearthing valuable insight—profitable data mining results ... SAS, C, Java and PMML

Data Mining using Enterprise Miner - SAS Data Mining using Enterprise Miner ... SAS Results For each 1 unit increase in the ... 42.2% 100*(0.401 1) 59.9% 40 SAS Results

Predictive Modeling with Enterprise Miner

Discovering Personality Type through Text Mining in SAS ... · Discovering Personality Type through Text Mining in SAS Enterprise Miner 12.1 . Deloitte & Touche LLP . Abstract . Data

SAS Notes SAS Enterprise Miner Software - Applying Data Mining Techniques

SAS ENTERPRISE MINER 5

Introduction to SAS Enterprise Miner 5.3 Softwaresupport.sas.com/publishing/pubcat/chaps/62040.pdf · Introduction to SAS Enterprise ... controls your data mining project. ... Generate

SAS Enterprise Miner 7 - IASRI Mining using SAS_Final... · SAS Enterprise Miner 7.1 Data Mining using SAS IASRI ... Select the predefined XML file of the diagram using the ... •

SOFTWARE TOOLS FOR TEACHING … TOOLS FOR TEACHING UNDERGRADUATE DATA MINING COURSE ... Introduction: Enormous amounts of data are ... SAS Enterprise Miner SAS Enterprise Miner software

DHL Data Mining Project - Singapore Management … Data Mining Project Table of Contents Introduction to DHL and the data ... preparation for mining. For Mining, SAS Enterprise Miner

SAS Enterprise Miner 13

Extending the Data Mining Software Packages SAS …web.ccsu.edu/datamining/Data Mining Theses/Don Wedding thesis.pdfExtending the Data Mining Software Packages SAS Enterprise Miner

1590475674 - Decision Trees for Business Intelligence and Data Mining~ Using SAS Enterprise Miner

Section 2.1 Introduction to Enterprise Miner. 2 Objectives Open Enterprise Miner. Explore the workspace components of Enterprise Miner. Set up a project

Applying Data Mining Techniques Using SAS Enterprise Miner

SAS Enterprise Miner High-Performance Data Mining Node ...support.sas.com/documentation/onlinedoc/miner/emhp... · • Training — Use the Training property to specify the percentage

Enterprise Miner