Upload
others
View
5
Download
1
Embed Size (px)
Citation preview
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
1 – DEEPER DIVE: VARIABLE SELECTION
ROUTINES IN SAS ENTERPRISE MINER
DR IAIN BROWN, ANALYTICS & INNOVATION PRACTICE, SAS UK
23 MARCH, 2017
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
DATA EXPLORATION
AND VISUALISATIONAGENDA
• SAS – 23rd March 2017 at 10:00am
• Deeper Dive: Variable Selection Routines in SAS Enterprise Miner
• The session looks at:
- Role of SAS Enterprise Miner
- Reasons for Variable Selection / Reduction
- Supervised vs Unsupervised
- Variable Selection vs Dimension Reduction
- Overview of Variable Selection / Reduction Nodes
- Variable Clustering and Principal Components Analysis Comparison
- Demonstration
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
ROLE OF SAS ENTERPRISE MINER
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
THE ANALYTICS
LIFECYCLEPREDICTIVE ANALYTICS AND DATA MINING
IDENTIFY /
FORMULATE
PROBLEM
DATA
PREPARATION
DATA
EXPLORATION
TRANSFORM
& SELECT
BUILD
MODEL
VALIDATE
MODEL
DEPLOY
MODEL
EVALUATE /
MONITOR
RESULTS
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
THE ANALYTICS
LIFECYCLEPREDICTIVE ANALYTICS AND DATA MINING
IDENTIFY /
FORMULATE
PROBLEM
DATA
PREPARATION
DATA
EXPLORATION
TRANSFORM
& SELECT
BUILD
MODEL
VALIDATE
MODEL
DEPLOY
MODEL
EVALUATE /
MONITOR
RESULTS
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
SAS®
ENTERPRISE
MINER™
SAS® ENTERPRISE MINER™
• Modern, collaborative, easy-to-use data mining
workbench
• Sophisticated set of data preparation and exploration
tools
• Modern suite of modeling techniques and methods
• Interactive model comparison, testing and validation
• Automated scoring process delivers faster results
• Open, extensible design for ultimate flexibility
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
SAS®
ENTERPRISE
MINER™
SAS® ENTERPRISE MINER™
MODEL DEVELOPMENT PROCESS
Sample Explore Modify Model Assess
Feature Selection /
Unsupervised
Learning
Feature
Creation
Supervised
Learning
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
SAS®
ENTERPRISE
MINER™
SAS® ENTERPRISE MINER™
MODEL DEVELOPMENT PROCESS
Utility Apps.
Time
Series HPDM
Credit
Scoring
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
SAS®
ENTERPRISE
MINER™
SAS® ENTERPRISE MINER™
SEMMA IN ACTION – REPEATABLE PROCESS
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
REASONS FOR VARIABLE
SELECTION/REDUCTION
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
SELECTIONWHY?
• Huge amounts of Data …
• A blessing or a curse?
• Problems with having too many variables
• Correlation
• Overfitting
• Sparseness
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
SELECTIONSUPERVISED VS UNSUPERVISED
• Supervised = variable reduction methods which use the target (dependent)
variable for selection
• Unsupervised = variable reduction methods which ignore the target
(dependent) variable in the selection process
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
SELECTIONOUTPUT VARIABLES
• Some variable reduction methods use the original variables as inputs into
subsequent models = Variable Selection
• Some variable reduction methods use combinations of the original variables
as inputs into subsequent models = Dimension Reduction
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
OVERVIEW OF VARIABLE SELECTION /
REDUCTION NODES
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
OVERVIEW CONCEPTS
• Variable selection vs variable combination
• Variable Selection
• Regression:
• Forward, Backward, Stepwise selection
• Tree based
• Variable Selection Node
• IGN
• LARS (LASSO)
• The LASSO method adds and deletes parameters based on a version of ordinary least squares where the sum of the
absolute regression coefficients is constrained.
• Variable Clustering:
• grouping correlated subsets of original variables;
• selecting variables with minimal resulting collinearity; representative “best” variable from each cluster
• Principal Components:
• uncorrelated linear combinations of all input variables
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
CONCEPTS
VAR01
VAR02
VAR03
VAR04
VAR05
VAR06
VAR07
VAR08
VAR09
VAR10
TARGET
VAR01
VAR02
VAR04
VAR07
VAR09
TARGET
Variable
Selection /
IGN based on
relationship
with TARGET:
Cluster Scores
based on
Variable
Clustering:
VAR01
VAR02
VAR03
VAR04
VAR05
VAR06
VAR07
VAR08
VAR09
VAR10
CLUS1
CLUS2
CLUS3
Best Variables
based on
Variable
Clustering:
VAR02
VAR05
VAR09
Input variables
and TARGET:Principal
Components:
VAR01
VAR02
VAR03
VAR04
VAR05
VAR06
VAR07
VAR08
VAR09
VAR10
PC01
…
VAR01
VAR02
VAR03
VAR04
VAR05
VAR06
VAR07
VAR08
VAR09
VAR10
PC10
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
SELECTIONVARIABLE SELECTION ROUTINES
• Variable Selection Node
• Relationship of independent variables to dependent target
• R-Square of Chi-square selection criteria
• Interactive Grouping Node
• Computes Weights of Evidence
• GINI and Information Values for variable selection
• Variable Clustering Node
• Identify correlations and covariance's between input variables
• Select Best variable from cluster or Cluster Component
• Principal Components Node
• Calculates eigenvalues and eigenvectors from the uncorrected covariance
matrix, corrected covariance matrix, or the correlation matrix of input variables
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE CLUSTERING AND PRINCIPAL
COMPONENTS ANALYSIS COMPARISON
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
SELECTIONVARIABLE CLUSTERING: MAIN FEATURES
• The Variable Clustering node divides the input variables into hierarchical clusters.
• The main idea is to select one variable (or the cluster component) from each cluster
as a cluster representative.
• The representative variables (or components) are used as input variables in
successor nodes.
• The other input variables are rejected.
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
SELECTIONVARIABLE CLUSTERING: MAIN FEATURES
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
SELECTIONVARIABLE CLUSTERING: MAIN FEATURES
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
SELECTIONVARIABLE CLUSTERING: MAIN FEATURES
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
Inputs selected by
cluster representation
expert opinion
target correlation
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
SELECTIONVARIABLE CLUSTERING: MAIN FEATURES
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
Inputs selected by
cluster representation
expert opinion
target correlation
X1
X4
X6
X8
X9
X10
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
SELECTIONVARIABLE CLUSTERING: MAIN FEATURES
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
Inputs selected by
cluster representation
expert opinion
target correlation
X1
X3
X4
X6
X8
X9
X10
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
SELECTION
VARIABLE CLUSTERING: WHAT IS A CLUSTER
COMPONENT?
• Each cluster can be described as a linear combination of the variables in the
cluster.
• This is the first principal component of the cluster.
• In this context, it is called the cluster component.
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
SELECTIONVARIABLE CLUSTERING: ALGORITHM
• The algorithm is divisive; at the start, all variables are in one single cluster.
• The following steps are repeated until convergence:
1. A cluster is chosen for splitting.
2. The chosen cluster is split into two clusters.
3. The variables are iteratively (re)assigned to the clusters.
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
SELECTIONVARIABLE CLUSTERING: LARGE DATASETS
• Computationally efficient if the data set has fewer than 100 variables and
fewer than 100,000 observations.
• If you have more than 100 variables:
• Use two-stage variable clustering.
• If you have more than 100,000 observations:
• Sample the data.
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
SELECTION
VARIABLE CLUSTERING: METHODS FOR REDUCING
PROCESSING TIME
• If the data set has more than 30 variables:
• If the number of clusters is known, specify the number of clusters.
• Set the Keep Hierarchies property to Yes.
• Set the Two Stage Clustering property to Yes.
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
SELECTIONVARIABLE CLUSTERING: TWO-STAGE
• This four-step approach is used to speed up variable clustering with more
than 100 input variables.
• Global clusters are formed and variable clustering is performed on each
global cluster.
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
SELECTIONVARIABLE CLUSTERING: PROS
• Reduction of collinearity
• Redundancy reduction with low information loss
• Identification of underlying data structure
• Interpretation of original input variables can be kept in successor nodes.
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
SELECTIONVARIABLE CLUSTERING: CONS
• One-stage clustering is not computationally efficient if more than 100 input
variables.
• Node cannot be used on data with more than 100,000 observations.
• Method is not so well-known. You need to explain it.
• Levels of categorical variables can be located in different clusters.
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
PRINCIPAL COMPONENTS ANALYSIS
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
SELECTIONPRINCIPAL COMPONENTS ANALYSIS
• Principal components are constructed as mathematical transformations of the
input variables.
• The first principal component is constructed in such a way that it captures as
much of the variation in the input variables (the X-space) set as possible.
• The second principal component is orthogonal to the first principal
component.
• The second principal component captures as much as possible of the
variation in the input data not captured by the first principal component.
• And so on ...
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
PCA: INPUT AND OUTPUT VARIABLES
• Input variables:
• Principal component 1:
• Principal component 2:
• Principal component 3:
321 , , xxx
3121111 xcxbxapc
3222122 xcxbxapc
3323133 xcxbxapc
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
SELECTIONPCA: SELECTION OF THE NUMBER OF PC’S
• The number of principal components used as input variables for the
successor modelling nodes can be selected using one of the following:
• Proportion of variance explained
• Scree plot
• Eigenvalue > 1
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
SELECTIONPCA: PROS
• Constructed output variables are definitely uncorrelated.
• The selection order of the principal components is automatically determined.
• The principal components are constructed in such a way that the first
principal component represents more of the variation in the data cloud than
the second one, and so on.
• Often, a very small number of principal components must be kept in order to
explain a lot of the variation in the data cloud.
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
SELECTIONPCA: CONS
• It is difficult or impossible to interpret the constructed principal components.
• It is difficult to know how many principal components should be selected as
new input variables.
• All original input variables are still used because they build the principal
components.
• Misinterpretation of the coefficients of the linear combinations is common.
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
SAS®
ENTERPRISE
MINER™
DEMONSTRATION
• Variable selection and reduction techniques
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
SELECTIONSUMMARY
• Comprehensive variable selection and dimension reduction toolset
• Number of approaches to data and dimension reduction
• Importance of enhancing data prior to model development
• Leads to:
• Better model stability
• Longer model life-span
• Reduced complexity
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d . www.SAS.com
QUESTIONS AND ANSWERS