Download pdf - Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Survey of Data Analytics Peter Bruce, President

The Institute for Statistics Education at Statistics.com

About Statistics.com:

• 100+ courses, introductory and advanced

• Traditional statistics, data mining, machine learning, text mining, clinical trials, optimization, use of R

• All online

• Typically 4 weeks, scheduled dates

• Don’t need to be online particular times/days

• Private discussion forum with instructors - noted authors & experts

[email protected]

From the simple to the complex…

• 35 different reports tracking traffic daily

• Midday report “are we on track for

visitors?”

• # visitors from key domains - .gov, .mil,

.senate or .house

Which photos do best?

1. Monkeys

2. Dogs

3. Cats

Analytics as Erector Set

• Most real-world analytics jobs involve building a

“machine” that produces outcomes (decisions)

• Different components, some simple in function, others

complex

Complex or “black-box” components

• Misunderstanding what the component does and how it works

compromises the “machine”

• Task of education typically focuses on the components, particularly

complex ones

• Machine-building skill comes with practice and experience in the

business context (hard to teach in school)

At the center lies prediction…

A man walks into a Target® store…

Predictive

Analytics

Cluster/Segment Affinity/Recommend

Also part of Data Analytics:

• Outlier Detection

• Profiling

• Exploration

• Text Mining

• Social Network Analysis

Supervised

learning

Statistics:

•Controlled Experiments

(A-B tests)

•Observational studies

•Estimation

Predictive Models – Supervised Learning

• Lots of predictor variables, train models with known

outcome (target, dependent) variables

• Use multiple methods (statistical, machine learning)

• Goes beyond the obvious, capturing complexity

• Implemented for real-time behavior and decisions

Classification

(categorical)

Prediction

(continuous)

Pregnant?

• Obvious retail clues – maternity clothes, baby food, baby

clothes, crib …

• These may be too late

• Earlier clues not so obvious – lotions, supplements, and,

esp., combinations and changes in purchase patterns

• Data mining algorithms can capture these less obvious,

more complex signals

Training the Model

• Each row is a customer

• Numerous predictor variables (mostly on purchase data),

target variable “pregnant?” (0/1)

• Those on baby shower registry are 1’s

• Women of similar demographic not on registry are 0’s

• Together they constitute the training set

• Known outcome

• Purchase data over time

K-NN (hypothetical data)

Cust # zinc10 zinc90 mag10 mag90 cotton10 cotton90 Registry

?

1 1 1 1 1 0 1 0

2 1 1 1 1 0 1 0

3 1 1 0 1 0 1 0

4 0 1 1 0 1 1 0

5 1 1 0 1 1 0 1

6 0 0 1 0 1 0 1

NEW 1 0 1 1 0 1 ?

Cust #

1 1 1 1 1 0 1 0

NEW 1 0 1 1 0 1 ?

dif 0 1 0 0 0 0

sq dif 0 1 0 0 0 0

sum=1

Cust #

6 0 0 1 0 1 0 1

NEW 1 0 1 1 0 1 ?

dif -1 0 0 -1 1 -1

sq dif 1 0 0 1 1 1

sum=4

•Calculating distance (illustration): The NEW customer is quite close to cust

#1, not so close to cust #6.

•Classification, for k=3: The three closest records (see prior slide) are 1, 2

and 3. They are all 0’s (not pregnant), so we classify the NEW customer as

“0.”

Statistical Distance/Similarity

• In the above procedure, we calculated numerical measure

of the distance between two records.

• There are various measures of statistical distance and

similarity

• Some are sensitive to scale, requiring normalization

(standardization)

• Used in clustering, nearest neighbor calculations.

K-Nearest Neighbors Classification

• Take a new record, find its closest neighbor (k=1)

• Assign that neighbor’s class to the new record

• Or… find the closest k records, find the majority* class,

and assign that class to the new record

• High k = smoothes over local information (too high and

useful information is lost)

• Low k = fits local information (too low and you fit the

signal not the noise)

*A lower cutoff may be used when classifying rare events

Classification Algorithms, cont.

• Logistic Regression

• CART

• Discriminant Analysis

• Neural Network

• Naïve Bayes

The Overfit Problem

0

200

400

600

800

1000

1200

1400

1600

0 200 400 600 800 1000

Rev

en

ue

Expenditure 0

200

400

600

800

1000

1200

1400

1600

0 200 400 600 800 1000

Rev

en

ue

Expenditure

Linear regression – fair fit (some error remains) Complex polynomial – perfect fit, but fits

noise too well. Will have lots of error with

new data. Complex models, esp. machine

learning ones, are prone to this problem.

Assess Model Performance

• Assess the model with new data that were not used to train the model

• Measures of performance:

• Accuracy (categorical)

• Lift (categorical/continuous)

• RMSE (continuous)

Data (sample)

Training

partition

Validation

partition

…

Model 1

Model 2

Model 3

Model n

Best

model

Assess and validate the

models

Performance metrics:

•RMSE (continuous)

•Accuracy (categorical)

•Lift (categorical)

Measuring Accuracy: Confusion Matrix (and Cutoff Control) Training Data scoring - Summary Report

Cut off Prob.Val. for Success (Updatable) 0.5

Classification Confusion Matrix

Predicted Class

Actual

Class 1 0

1 9 3

0 11 247

The analyst sets this in the

classification algorithm.

•Higher values > fewer predicted 1’s

(fewer false positives, more false

negatives)

•Lower values > more predicted 1’s

(more false positives, fewer false

negatives)

Classification accuracy

= (9+247)/(9+3+11+247)

= 256/270

= 0.948

But wait… classifying everyone as “0” yields accuracy of 0.956!

Lift

• Need metric that reflects greater importance of the “pregnant” category, which is rare

• Lift is the model’s improvement over average random selection

• First step: take the predictions and rank them in order of belonging to the class of interest

• Next, review accuracy of predictions by decile

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9 10

Decile m

ean

/ G

lob

al m

ean

Deciles

Decile-wise lift chart (validation dataset)

The top decile is 5.2 times more likely to

be a “1” than the average record. We are

using our model to “skim the cream” and

the decile chart measures how much

cream the model has captured in each

Data (sample)

Training

partition

Validation

partition

…

Model 1

Model 2

Model 3

Model n

Best

model

Add test partition for unbiased estimate

of performance on new data

Test

partition

Unbiased estimate of

performance

Software

• SAS Enterprise Miner $$$$

• IBM SPSS Modeler (Clementine) $$$$

• XLMiner (Excel add-in) $

• Statistica Data Miner $$

• Salford Systems $$

• Rapid Miner $$ (open source free version)

• R open source free

K-NN

• Model-less prediction – use where features of data a

highly local, and without structure.

Linear classifier

• Lack of flexibility leads to more error

Source for figures: Hastie, Tibshirani and Friedman, The Art of Statistical Learning

KNN in XLMiner

• Partition

• Normalize

• Find best k

Predictive Modeling via classical statistics

• Linear Regression


• Discriminant Analysis

Predictive Modeling – Machine Learning

Classification & Regression Trees (CART)

Recursively partition data for homogeneity, derive rules:

Trees, cont.

Complete partitioning leads to 100% homogeneity (100%

classification accuracy) but overfits:

Machine Learning (cont.)

Neural Networks – like regression on steroids:

Regression: CA = β1*fat + β2*salt + ε

NN: proliferation of coefficients & interactions, + iterative learning

Ensemble Methods

• Fit multiple different models

• Try additional models that are weighted average of the

predictions from multiple “single” models

• “Bagging” – fit models to bootstrap samples of cases, take

average prediction (or majority vote for classification)

• “Boosting” – iteratively fit models, each time adjusting

case weights

• Overweight the hard to predict cases

• Underweight the easy to predict cases

• Average the models, giving most weight to the earlier ones

Crowdsourcing - Kaggle

• Publish the data for which you want a model

• Let the hacker community compete

• Overfit danger – lots of energy is spent building a perfect

model of a static data set. The real world is dynamic,

modeling is an ongoing process.

• Usually, most of the big gain comes from the simple, the

obvious

Netflix Prize – Predicting Customer Ratings

Goal: 10% improvement in RMSE of predicted customer rating

PA - Courses at Statistics.com

• Predictive Modeling

• Trees

• Data Mining in R


Clustering-Segmentation

• Established statistical technique used from Astronomy to

Zoology

• Used in business to identify different customer segments

to be targeted with different marketing approaches

Agglomerative Clustering

• Join two closest cases together in a cluster

• Now you have many single-case clusters, plus one cluster

of two cases.

• Again, join the two closest clusters together (whether

single-case clusters or the two case cluster)

• As the process proceeds, you have fewer and fewer

clusters

Return to Hypothetical Purchase Data

Cust # zinc10 zinc90 mag10 mag90 cotton10 cotton90 Registry

?

1 1 1 1 1 0 1 0

2 1 1 1 1 0 1 0

3 1 1 0 1 0 1 0

4 0 1 1 0 1 1 0

5 1 1 0 1 1 0 1

6 0 0 1 0 1 0 1

Step 1

Step 2

Final result is a dendrogram

1 18 14 19 3 9 6 2 22 4 20 10 13 5 8 16 11 7 12 21 15 17 0

22000

0

5

10

15

20

25

30

35

40

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Dis

tan

ce

Dendrogram(Ward's method)

Y value = distance

between clusters

Sliding horizontal line indicates # of clusters

1 18 14 19 3 9 6 2 22 4 20 10 13 5 8 16 11 7 12 21 15 17 0

22000

0

5

10

15

20

25

30

35

40

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Dis

tan

ce

Dendrogram(Ward's method)

Measuring closeness of clusters

• Minimize minimum distance between two clusters

• Minimize maximum distance

• Minimize average distance

• Minimize distance between centroids

• Minimize loss of information (ESS) that comes from

joining two clusters – Ward’s Method

Different metrics can yield very different results, and even

random data exhibits apparent clustering.

Measuring closeness of clusters

• Minimize minimum distance between two clusters

• Minimize maximum distance

• Minimize average distance

• Minimize distance between centroids

• Minimize loss of information (ESS) that comes from

joining two clusters – Ward’s Method

Different metrics can yield very different results, and even

random data exhibits apparent clustering.

Recommender Systems

Association Rules, Affinity Analysis binary transaction matrix

Cust # zinc10 zinc90 mag10 mag90 cotton10 cotton90

1 1 1 1 1 0 1

2 1 1 1 1 0 1

3 1 1 0 1 0 1

4 0 1 1 0 1 1

5 1 1 0 1 1 0

6 0 0 1 0 1 0

7 1 0 1 1 0 1

What goes with what

• Apriori algorithm to generate lists of item-sets (not

transactions)

• Antecedent and consequent item sets form rules

• Support: % of transactions with a given item-set

• Confidence: % of transactions w/ antecedent that also

have consequent

• Lift: (rule confidence)/(consequent support), or

In looking for the consequent item-set, how much gain do you get from

the rule, as opposed to randomly picking transactions?

Text analytics (very briefly)

• Origins in linguistics and computer science

• “Natural” in NLP – methods of computer language

processing were extended to natural languages

• Huge challenge – ambiguity & complexity pervades

natural language

• Hierarchy of recognition tasks > tokenization

x fox The fox jumped over the log

Ambiguity

White space and punctuation as delimiters, but consider…

Clairson International Corp. said it expects to report a

net loss for it’s second quarter ended March 26 and

doesn’t expect to meet analysts’ profit estimates of $3.9

to $4 million, or 76 cents a share to 79 cents a share, for

its year ending Sept. 24.

Ambiguity, cont.

Or…

A series of mono and di-N-2,3-epoxypropyl N-

phenylhydrazones have been prepared on a large scale

by reaction of the corresponding N-phenylhydrazones of

9-ethyl-3carbazolecarbaldehyde, 9-ethyl-3-

6carbazoledicarbaldehyde, ….

“Bag of words” and sentiment

Common metric: counts of positive and negative words

I adore the hero of “I hate love stories”

Simple approach will misclassify “hate”

Complexity counts • Greater statistical power from more sophisticated systems

• Ben Bernanke and Aug. 30, 2012 Jackson Hole speech – digital

version was rapidly analyzed and there was a stock sell-off

Google searches: Sparsity and Big Data

Resampling & classical statistics

• Hypothesis tests & confidence intervals

• What-if simulation (OR applications)

• Example (confidence interval): median of 100 incomes

1. Place all values in a box

2. Randomly pick one, record, replace

3. Repeat 99 more times, record median

4. Repeat steps 2+3, say, 1000 times

5. Review distribution of medians

Resampled medians

0

50

100

150

200

250

21.65 22.15 22.65 23.15 23.65 24.15 24.65 25.15 25.65 26.15 26.65 27.15 27.65 28.15 28.65 29.15 29.65

Co

un

ts

Resampling Stats for Excel

Repeat & Score dialog

A-B test (of 2 web offers)

• Control: 220 views > 7 clicks = 0.0318

• Treatment: 195 views > 11 clicks = 0.0564

• 77% improvement: 0.0564/0.0318 (=1.77)

Resampling test:

1. Box with 18 1’s and 397 0’s (total 415)

2. Shuffle & draw 220, count 1’s

3. Count 1’s in remaining 195

4. Record ratio (the 195 group to the 220 group)

5. Repeat steps 2-4, say 1000 times

6. How often >= 1.77?

Results (143 of 1000 trials >= 1.77)

0

50

100

150

200

250

300

Co

un

ts

Material presented is partially drawn from

Text, includes XLMiner User Guide