Survey of Data Analytics Peter Bruce, President
The Institute for Statistics Education at Statistics.com
About Statistics.com:
• 100+ courses, introductory and advanced
• Traditional statistics, data mining, machine learning, text mining, clinical trials, optimization, use of R
• All online
• Typically 4 weeks, scheduled dates
• Don’t need to be online particular times/days
• Private discussion forum with instructors - noted authors & experts
From the simple to the complex…
• 35 different reports tracking traffic daily
• Midday report “are we on track for
visitors?”
• # visitors from key domains - .gov, .mil,
.senate or .house
Which photos do best?
1. Monkeys
2. Dogs
3. Cats
Analytics as Erector Set
• Most real-world analytics jobs involve building a
“machine” that produces outcomes (decisions)
• Different components, some simple in function, others
complex
Complex or “black-box” components
• Misunderstanding what the component does and how it works
compromises the “machine”
• Task of education typically focuses on the components, particularly
complex ones
• Machine-building skill comes with practice and experience in the
business context (hard to teach in school)
At the center lies prediction…
A man walks into a Target® store…
Predictive
Analytics
Cluster/Segment Affinity/Recommend
Also part of Data Analytics:
• Outlier Detection
• Profiling
• Exploration
• Text Mining
• Social Network Analysis
Supervised
learning
Statistics:
•Controlled Experiments
(A-B tests)
•Observational studies
•Estimation
Predictive Models – Supervised Learning
• Lots of predictor variables, train models with known
outcome (target, dependent) variables
• Use multiple methods (statistical, machine learning)
• Goes beyond the obvious, capturing complexity
• Implemented for real-time behavior and decisions
Classification
(categorical)
Prediction
(continuous)
Pregnant?
• Obvious retail clues – maternity clothes, baby food, baby
clothes, crib …
• These may be too late
• Earlier clues not so obvious – lotions, supplements, and,
esp., combinations and changes in purchase patterns
• Data mining algorithms can capture these less obvious,
more complex signals
Training the Model
• Each row is a customer
• Numerous predictor variables (mostly on purchase data),
target variable “pregnant?” (0/1)
• Those on baby shower registry are 1’s
• Women of similar demographic not on registry are 0’s
• Together they constitute the training set
• Known outcome
• Purchase data over time
K-NN (hypothetical data)
Cust # zinc10 zinc90 mag10 mag90 cotton10 cotton90 Registry
?
1 1 1 1 1 0 1 0
2 1 1 1 1 0 1 0
3 1 1 0 1 0 1 0
4 0 1 1 0 1 1 0
5 1 1 0 1 1 0 1
6 0 0 1 0 1 0 1
NEW 1 0 1 1 0 1 ?
Cust #
1 1 1 1 1 0 1 0
NEW 1 0 1 1 0 1 ?
dif 0 1 0 0 0 0
sq dif 0 1 0 0 0 0
sum=1
Cust #
6 0 0 1 0 1 0 1
NEW 1 0 1 1 0 1 ?
dif -1 0 0 -1 1 -1
sq dif 1 0 0 1 1 1
sum=4
•Calculating distance (illustration): The NEW customer is quite close to cust
#1, not so close to cust #6.
•Classification, for k=3: The three closest records (see prior slide) are 1, 2
and 3. They are all 0’s (not pregnant), so we classify the NEW customer as
“0.”
Statistical Distance/Similarity
• In the above procedure, we calculated numerical measure
of the distance between two records.
• There are various measures of statistical distance and
similarity
• Some are sensitive to scale, requiring normalization
(standardization)
• Used in clustering, nearest neighbor calculations.
K-Nearest Neighbors Classification
• Take a new record, find its closest neighbor (k=1)
• Assign that neighbor’s class to the new record
• Or… find the closest k records, find the majority* class,
and assign that class to the new record
• High k = smoothes over local information (too high and
useful information is lost)
• Low k = fits local information (too low and you fit the
signal not the noise)
*A lower cutoff may be used when classifying rare events
Classification Algorithms, cont.
• Logistic Regression
• CART
• Discriminant Analysis
• Neural Network
• Naïve Bayes
The Overfit Problem
0
200
400
600
800
1000
1200
1400
1600
0 200 400 600 800 1000
Rev
en
ue
Expenditure 0
200
400
600
800
1000
1200
1400
1600
0 200 400 600 800 1000
Rev
en
ue
Expenditure
Linear regression – fair fit (some error remains) Complex polynomial – perfect fit, but fits
noise too well. Will have lots of error with
new data. Complex models, esp. machine
learning ones, are prone to this problem.
Assess Model Performance
• Assess the model with new data that were not used to train the model
• Measures of performance:
• Accuracy (categorical)
• Lift (categorical/continuous)
• RMSE (continuous)
Data (sample)
Training
partition
Validation
partition
…
Model 1
Model 2
Model 3
Model n
Best
model
Assess and validate the
models
Performance metrics:
•RMSE (continuous)
•Accuracy (categorical)
•Lift (categorical)
Measuring Accuracy: Confusion Matrix (and Cutoff Control) Training Data scoring - Summary Report
Cut off Prob.Val. for Success (Updatable) 0.5
Classification Confusion Matrix
Predicted Class
Actual
Class 1 0
1 9 3
0 11 247
The analyst sets this in the
classification algorithm.
•Higher values > fewer predicted 1’s
(fewer false positives, more false
negatives)
•Lower values > more predicted 1’s
(more false positives, fewer false
negatives)
Classification accuracy
= (9+247)/(9+3+11+247)
= 256/270
= 0.948
But wait… classifying everyone as “0” yields accuracy of 0.956!
Lift
• Need metric that reflects greater importance of the “pregnant” category, which is rare
• Lift is the model’s improvement over average random selection
• First step: take the predictions and rank them in order of belonging to the class of interest
• Next, review accuracy of predictions by decile
0
1
2
3
4
5
6
1 2 3 4 5 6 7 8 9 10
Decile m
ean
/ G
lob
al m
ean
Deciles
Decile-wise lift chart (validation dataset)
The top decile is 5.2 times more likely to
be a “1” than the average record. We are
using our model to “skim the cream” and
the decile chart measures how much
cream the model has captured in each
Data (sample)
Training
partition
Validation
partition
…
Model 1
Model 2
Model 3
Model n
Best
model
Add test partition for unbiased estimate
of performance on new data
Test
partition
Unbiased estimate of
performance
Software
• SAS Enterprise Miner $$$$
• IBM SPSS Modeler (Clementine) $$$$
• XLMiner (Excel add-in) $
• Statistica Data Miner $$
• Salford Systems $$
• Rapid Miner $$ (open source free version)
• R open source free
K-NN
• Model-less prediction – use where features of data a
highly local, and without structure.
Linear classifier
• Lack of flexibility leads to more error
Source for figures: Hastie, Tibshirani and Friedman, The Art of Statistical Learning
KNN in XLMiner
• Partition
• Normalize
• Find best k
Predictive Modeling via classical statistics
• Linear Regression
• Logistic Regression
• Discriminant Analysis
Predictive Modeling – Machine Learning
Classification & Regression Trees (CART)
Recursively partition data for homogeneity, derive rules:
Trees, cont.
Complete partitioning leads to 100% homogeneity (100%
classification accuracy) but overfits:
Machine Learning (cont.)
Neural Networks – like regression on steroids:
Regression: CA = β1*fat + β2*salt + ε
NN: proliferation of coefficients & interactions, + iterative learning
Ensemble Methods
• Fit multiple different models
• Try additional models that are weighted average of the
predictions from multiple “single” models
• “Bagging” – fit models to bootstrap samples of cases, take
average prediction (or majority vote for classification)
• “Boosting” – iteratively fit models, each time adjusting
case weights
• Overweight the hard to predict cases
• Underweight the easy to predict cases
• Average the models, giving most weight to the earlier ones
Crowdsourcing - Kaggle
• Publish the data for which you want a model
• Let the hacker community compete
• Overfit danger – lots of energy is spent building a perfect
model of a static data set. The real world is dynamic,
modeling is an ongoing process.
• Usually, most of the big gain comes from the simple, the
obvious
Netflix Prize – Predicting Customer Ratings
Goal: 10% improvement in RMSE of predicted customer rating
PA - Courses at Statistics.com
• Predictive Modeling
• Trees
• Data Mining in R
• Logistic Regression
Clustering-Segmentation
• Established statistical technique used from Astronomy to
Zoology
• Used in business to identify different customer segments
to be targeted with different marketing approaches
Agglomerative Clustering
• Join two closest cases together in a cluster
• Now you have many single-case clusters, plus one cluster
of two cases.
• Again, join the two closest clusters together (whether
single-case clusters or the two case cluster)
• As the process proceeds, you have fewer and fewer
clusters
Return to Hypothetical Purchase Data
Cust # zinc10 zinc90 mag10 mag90 cotton10 cotton90 Registry
?
1 1 1 1 1 0 1 0
2 1 1 1 1 0 1 0
3 1 1 0 1 0 1 0
4 0 1 1 0 1 1 0
5 1 1 0 1 1 0 1
6 0 0 1 0 1 0 1
Step 1
Step 2
Final result is a dendrogram
1 18 14 19 3 9 6 2 22 4 20 10 13 5 8 16 11 7 12 21 15 17 0
22000
0
5
10
15
20
25
30
35
40
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Dis
tan
ce
Dendrogram(Ward's method)
Y value = distance
between clusters
Sliding horizontal line indicates # of clusters
1 18 14 19 3 9 6 2 22 4 20 10 13 5 8 16 11 7 12 21 15 17 0
22000
0
5
10
15
20
25
30
35
40
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Dis
tan
ce
Dendrogram(Ward's method)
Measuring closeness of clusters
• Minimize minimum distance between two clusters
• Minimize maximum distance
• Minimize average distance
• Minimize distance between centroids
• Minimize loss of information (ESS) that comes from
joining two clusters – Ward’s Method
Different metrics can yield very different results, and even
random data exhibits apparent clustering.
Measuring closeness of clusters
• Minimize minimum distance between two clusters
• Minimize maximum distance
• Minimize average distance
• Minimize distance between centroids
• Minimize loss of information (ESS) that comes from
joining two clusters – Ward’s Method
Different metrics can yield very different results, and even
random data exhibits apparent clustering.
Recommender Systems
Association Rules, Affinity Analysis binary transaction matrix
Cust # zinc10 zinc90 mag10 mag90 cotton10 cotton90
1 1 1 1 1 0 1
2 1 1 1 1 0 1
3 1 1 0 1 0 1
4 0 1 1 0 1 1
5 1 1 0 1 1 0
6 0 0 1 0 1 0
7 1 0 1 1 0 1
What goes with what
• Apriori algorithm to generate lists of item-sets (not
transactions)
• Antecedent and consequent item sets form rules
• Support: % of transactions with a given item-set
• Confidence: % of transactions w/ antecedent that also
have consequent
• Lift: (rule confidence)/(consequent support), or
In looking for the consequent item-set, how much gain do you get from
the rule, as opposed to randomly picking transactions?
Text analytics (very briefly)
• Origins in linguistics and computer science
• “Natural” in NLP – methods of computer language
processing were extended to natural languages
• Huge challenge – ambiguity & complexity pervades
natural language
• Hierarchy of recognition tasks > tokenization
x fox The fox jumped over the log
Ambiguity
White space and punctuation as delimiters, but consider…
Clairson International Corp. said it expects to report a
net loss for it’s second quarter ended March 26 and
doesn’t expect to meet analysts’ profit estimates of $3.9
to $4 million, or 76 cents a share to 79 cents a share, for
its year ending Sept. 24.
Ambiguity, cont.
Or…
A series of mono and di-N-2,3-epoxypropyl N-
phenylhydrazones have been prepared on a large scale
by reaction of the corresponding N-phenylhydrazones of
9-ethyl-3carbazolecarbaldehyde, 9-ethyl-3-
6carbazoledicarbaldehyde, ….
“Bag of words” and sentiment
Common metric: counts of positive and negative words
I adore the hero of “I hate love stories”
Simple approach will misclassify “hate”
Complexity counts • Greater statistical power from more sophisticated systems
• Ben Bernanke and Aug. 30, 2012 Jackson Hole speech – digital
version was rapidly analyzed and there was a stock sell-off
Google searches: Sparsity and Big Data
Resampling & classical statistics
• Hypothesis tests & confidence intervals
• What-if simulation (OR applications)
• Example (confidence interval): median of 100 incomes
1. Place all values in a box
2. Randomly pick one, record, replace
3. Repeat 99 more times, record median
4. Repeat steps 2+3, say, 1000 times
5. Review distribution of medians
Resampled medians
0
50
100
150
200
250
21.65 22.15 22.65 23.15 23.65 24.15 24.65 25.15 25.65 26.15 26.65 27.15 27.65 28.15 28.65 29.15 29.65
Co
un
ts
Resampling Stats for Excel
Repeat & Score dialog
A-B test (of 2 web offers)
• Control: 220 views > 7 clicks = 0.0318
• Treatment: 195 views > 11 clicks = 0.0564
• 77% improvement: 0.0564/0.0318 (=1.77)
Resampling test:
1. Box with 18 1’s and 397 0’s (total 415)
2. Shuffle & draw 220, count 1’s
3. Count 1’s in remaining 195
4. Record ratio (the 195 group to the 220 group)
5. Repeat steps 2-4, say 1000 times
6. How often >= 1.77?
Results (143 of 1000 trials >= 1.77)
0
50
100
150
200
250
300
Co
un
ts
Material presented is partially drawn from
Text, includes XLMiner User Guide