Upload
lybao
View
215
Download
2
Embed Size (px)
Citation preview
ffi rs.indd 01:56:13:PM 03/28/2014 Page i
Applied Predictive Analytics
Principles and Techniques for the Professional Data Analyst
Dean Abbott
ffi rs.indd 01:56:13:PM 03/28/2014 Page ii
Applied Predictive Analytics: Principles and Techniques for the Professional Data Analyst
Published by
John Wiley & Sons, Inc.
10475 Crosspoint Boulevard
Indianapolis, IN 46256
www.wiley.com
Copyright © 2014 by John Wiley & Sons, Inc., Indianapolis, Indiana
Published simultaneously in Canada
ISBN: 978-1-118-72796-6
ISBN: 978-1-118-72793-5 (ebk)
ISBN: 978-1-118-72769-0 (ebk)
Manufactured in the United States of America
10 9 8 7 6 5 4 3 2 1
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means,
electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or
108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or autho-
rization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive,
Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed
to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201)
748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with
respect to the accuracy or completeness of the contents of this work and specifi cally disclaim all warranties, including
without limitation warranties of fi tness for a particular purpose. No warranty may be created or extended by sales or
promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work
is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional
services. If professional assistance is required, the services of a competent professional person should be sought.
Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or
Web site is referred to in this work as a citation and/or a potential source of further information does not mean that
the author or the publisher endorses the information the organization or website may provide or recommendations
it may make. Further, readers should be aware that Internet websites listed in this work may have changed or disap-
peared between when this work was written and when it is read.
For general information on our other products and services please contact our Customer Care Department within the
United States at (877) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with
standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to
media such as a CD or DVD that is not included in the version you purchased, you may download this material at
http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.
Library of Congress Control Number: 2013958302
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or
its affi liates, in the United States and other countries, and may not be used without written permission. [Insert third-
party trademark information] All other trademarks are the property of their respective owners. John Wiley & Sons,
Inc. is not associated with any product or vendor mentioned in this book.
v
ffi rs.indd 01:56:13:PM 03/28/2014 Page v
Dean Abbott is President of Abbott Analytics, Inc. in San Diego, California.
Dean is an internationally recognized data-mining and predictive analytics
expert with over two decades of experience applying advanced modeling and
data preparation techniques to a wide variety of real-world problems. He is
also Chief Data Scientist of SmarterRemarketer, a startup company focusing
on data-driven behavior segmentation and attribution.
Dean has been recognized as a top-ten data scientist and one of the top ten
most infl uential people in data analytics. His blog has been recognized as one
of the top-ten predictive analytics blogs to follow.
He is a regular speaker at Predictive Analytics World and other analytics
conferences. He is on the advisory board for the University of California Irvine
Certifi cate Program for predictive analytics and the University of California
San Diego Certifi cate Program for data mining, and is a regular instructor for
courses on predictive modeling algorithms, model deployment, and text min-
ing. He has also served several times on the program committee for the KDD
Conference Industrial Track.
About the Author
vii
ffi rs.indd 01:56:13:PM 03/28/2014 Page vii
William J. Komp has a Ph.D. from the University of Wisconsin–Milwaukee,
with a specialization in the fi elds of general relativity and cosmology. He has
been a professor of physics at the University of Louisville and Western Kentucky
University. Currently, he is a research scientist at Humana, Inc., working in the
areas of predictive analytics and data mining.
About the Technical Editor
ix
ffi rs.indd 01:56:13:PM 03/28/2014 Page ix
Executive EditorRobert Elliott
Project EditorAdaobi Obi Tulton
Technical EditorWilliam J. Komp
Senior Production EditorKathleen Wisor
Copy EditorNancy Rapoport
Manager of Content Development and AssemblyMary Beth Wakefi eld
Director of Community MarketingDavid Mayhew
Marketing ManagerAshley Zurcher
Business ManagerAmy Knies
Vice President and Executive Group PublisherRichard Swadley
Associate PublisherJim Minatel
Project Coordinator, CoverTodd Klemme
ProofreaderNancy Carrasco
IndexerJohnna VanHoose Dinse
Cover DesignerRyan Sneed
Credits
xi
ffi rs.indd 01:56:13:PM 03/28/2014 Page xi
The idea for this book began with a phone call from editor Bob Elliott, who pre-
sented the idea of writing a different kind of predictive analytics book geared
toward business professionals. My passion for more than a decade has been
to teach principles of data mining and predictive analytics to business profes-
sionals, translating the lingo of mathematics and statistics into a language the
practitioner can understand. The questions of hundreds of course and work-
shop attendees forced me to think about why we do what we do in predictive
analytics. I also thank Bob for not only persuading me that I could write the
book while continuing to be a consultant, but also for his advice on the scope
and depth of topics to cover.
I thank my father for encouraging me in analytics. I remember him teaching
me how to compute batting average and earned run average when I was eight
years old so I could compute my Little League statistics. He brought home reams
of accounting pads, which I used for playing thousands of Strat-O-Matic baseball
games, just so that I could compute everyone’s batting average and earned run
average, and see if there were signifi cant differences between what I observed
and what the players’ actual statistics were. My parents put up with a lot of
paper strewn across the fl oor for many years.
I would never have been in this fi eld were it not for Roger Barron, my fi rst
boss at Barron Associates, Inc., a pioneer in statistical learning methods and
a man who taught me to be curious, thorough, and persistent about data analysis.
His ability to envision solutions without knowing exactly how they would be
solved is something I refl ect on often.
I’ve learned much over the past 27 years from John Elder, a friend and col-
league, since our time at Barron Associates, Inc. and continuing to this day.
I am very grateful to Eric Siegel for inviting me to speak regularly at Predictive
Acknowledgments
xii Acknowledgments
ffi rs.indd 01:56:13:PM 03/28/2014 Page xii
Analytics World in sessions and workshops, and for his advice and encourage-
ment in the writing of this book.
A very special thanks goes to editors Adaobi Obi Tulton and Nancy Rapoport
for making sense of what of I was trying to communicate and making this book
more concise and clearer than I was able to do alone. Obviously, I was a math-
ematics major and not an English major. I am especially grateful for technical
editor William Komp, whose insightful comments throughout the book helped
me to sharpen points I was making.
Several software packages were used in the analyses contained in the book,
including KNIME, IBM SPSS Modeler, JMP, Statistica, Predixion, and Orange.
I thank all of these vendors for creating software that is easy to use. I also want
to thank all of the other vendors I’ve worked with over the past two decades,
who have supplied software for me to use in teaching and research.
On a personal note, this book project could not have taken place without the
support of my wife, Barbara, who encouraged me throughout the project, even
as it wore on and seemed to never end. She put up with me as I spent countless
nights and Saturdays writing, and cheered with each chapter that was submit-
ted. My two children still at home were very patient with a father who wasn’t
as available as usual for nearly a year.
As a Christian, my worldview is shaped by God and how He reveals Himself
in the Bible, through his Son, and by his Holy Spirit. When I look back at the
circumstances and people He put into my life, I only marvel at His hand
of providence, and am thankful that I’ve been able to enjoy this fi eld for my
entire career.
xiii
ftoc.indd 01:52:15:PM 03/28/2014 Page xiii
Introduction xxi
Chapter 1 Overview of Predictive Analytics 1
What Is Analytics? 3What Is Predictive Analytics? 3
Supervised vs. Unsupervised Learning 5
Parametric vs. Non-Parametric Models 6
Business Intelligence 6Predictive Analytics vs. Business Intelligence 8
Do Predictive Models Just State the Obvious? 9
Similarities between Business Intelligence
and Predictive Analytics 9
Predictive Analytics vs. Statistics 10Statistics and Analytics 11
Predictive Analytics and Statistics Contrasted 12
Predictive Analytics vs. Data Mining 13Who Uses Predictive Analytics? 13Challenges in Using Predictive Analytics 14
Obstacles in Management 14
Obstacles with Data 14
Obstacles with Modeling 15
Obstacles in Deployment 16
What Educational Background Is Needed to Become a Predictive Modeler? 16
Chapter 2 Setting Up the Problem 19
Predictive Analytics Processing Steps: CRISP-DM 19Business Understanding 21
The Three-Legged Stool 22
Business Objectives 23
Contents
xiv Contents
ftoc.indd 01:52:15:PM 03/28/2014 Page xiv
Defi ning Data for Predictive Modeling 25Defi ning the Columns as Measures 26
Defi ning the Unit of Analysis 27
Which Unit of Analysis? 28
Defi ning the Target Variable 29Temporal Considerations for Target Variable 31
Defi ning Measures of Success for Predictive Models 32Success Criteria for Classifi cation 32
Success Criteria for Estimation 33
Other Customized Success Criteria 33
Doing Predictive Modeling Out of Order 34Building Models First 34
Early Model Deployment 35
Case Study: Recovering Lapsed Donors 35Overview 36
Business Objectives 36
Data for the Competition 36
The Target Variables 36
Modeling Objectives 37
Model Selection and Evaluation Criteria 38
Model Deployment 39
Case Study: Fraud Detection 39Overview 39
Business Objectives 39
Data for the Project 40
The Target Variables 40
Modeling Objectives 41
Model Selection and Evaluation Criteria 41
Model Deployment 41
Summary 42
Chapter 3 Data Understanding 43
What the Data Looks Like 44Single Variable Summaries 44
Mean 45
Standard Deviation 45
The Normal Distribution 45
Uniform Distribution 46
Applying Simple Statistics in Data Understanding 47
Skewness 49
Kurtosis 51
Rank-Ordered Statistics 52
Categorical Variable Assessment 55
Data Visualization in One Dimension 58Histograms 59Multiple Variable Summaries 64
Contents xv
ftoc.indd 01:52:15:PM 03/28/2014 Page xv
Hidden Value in Variable Interactions: Simpson’s Paradox 64
The Combinatorial Explosion of Interactions 65
Correlations 66
Spurious Correlations 66
Back to Correlations 67
Crosstabs 68
Data Visualization, Two or Higher Dimensions 69Scatterplots 69
Anscombe’s Quartet 71
Scatterplot Matrices 75
Overlaying the Target Variable in Summary 76
Scatterplots in More Than Two Dimensions 78
The Value of Statistical Signifi cance 80Pulling It All Together into a Data Audit 81Summary 82
Chapter 4 Data Preparation 83
Variable Cleaning 84Incorrect Values 84
Consistency in Data Formats 85
Outliers 85
Multidimensional Outliers 89
Missing Values 90
Fixing Missing Data 91
Feature Creation 98Simple Variable Transformations 98
Fixing Skew 99
Binning Continuous Variables 103
Numeric Variable Scaling 104
Nominal Variable Transformation 107
Ordinal Variable Transformations 108
Date and Time Variable Features 109
ZIP Code Features 110
Which Version of a Variable Is Best? 110
Multidimensional Features 112
Variable Selection Prior to Modeling 117
Sampling 123
Example: Why Normalization Matters for K-Means Clustering 139
Summary 143
Chapter 5 Itemsets and Association Rules 145
Terminology 146Condition 147
Left-Hand-Side, Antecedent(s) 148
Right-Hand-Side, Consequent, Output, Conclusion 148
Rule (Item Set) 148
xvi Contents
ftoc.indd 01:52:15:PM 03/28/2014 Page xvi
Support 149
Antecedent Support 149
Confi dence, Accuracy 150
Lift 150
Parameter Settings 151How the Data Is Organized 151
Standard Predictive Modeling Data Format 151
Transactional Format 152
Measures of Interesting Rules 154Deploying Association Rules 156
Variable Selection 157
Interaction Variable Creation 157
Problems with Association Rules 158Redundant Rules 158
Too Many Rules 158
Too Few Rules 159
Building Classifi cation Rules from Association Rules 159Summary 161
Chapter 6 Descriptive Modeling 163
Data Preparation Issues with Descriptive Modeling 164Principal Component Analysis 165
The PCA Algorithm 165
Applying PCA to New Data 169
PCA for Data Interpretation 171
Additional Considerations before Using PCA 172
The Effect of Variable Magnitude on PCA Models 174
Clustering Algorithms 177The K-Means Algorithm 178
Data Preparation for K-Means 183
Selecting the Number of Clusters 185
The Kohonen SOM Algorithm 192
Visualizing Kohonen Maps 194
Similarities with K-Means 196
Summary 197
Chapter 7 Interpreting Descriptive Models 199
Standard Cluster Model Interpretation 199Problems with Interpretation Methods 202
Identifying Key Variables in Forming Cluster Models 203
Cluster Prototypes 209
Cluster Outliers 210
Summary 212
Chapter 8 Predictive Modeling 213
Decision Trees 214The Decision Tree Landscape 215
Building Decision Trees 218
Contents xvii
ftoc.indd 01:52:15:PM 03/28/2014 Page xvii
Decision Tree Splitting Metrics 221
Decision Tree Knobs and Options 222
Reweighting Records: Priors 224
Reweighting Records: Misclassifi cation Costs 224
Other Practical Considerations for Decision Trees 229
Logistic Regression 230Interpreting Logistic Regression Models 233
Other Practical Considerations for Logistic Regression 235
Neural Networks 240Building Blocks: The Neuron 242
Neural Network Training 244
The Flexibility of Neural Networks 247
Neural Network Settings 249
Neural Network Pruning 251
Interpreting Neural Networks 252
Neural Network Decision Boundaries 253
Other Practical Considerations for Neural Networks 253
K-Nearest Neighbor 254The k-NN Learning Algorithm 254
Distance Metrics for k-NN 258
Other Practical Considerations for k-NN 259
Naïve Bayes 264Bayes’ Theorem 264
The Naïve Bayes Classifi er 268
Interpreting Naïve Bayes Classifi ers 268
Other Practical Considerations for Naïve Bayes 269
Regression Models 270Linear Regression 271
Linear Regression Assumptions 274
Variable Selection in Linear Regression 276
Interpreting Linear Regression Models 278
Using Linear Regression for Classifi cation 279
Other Regression Algorithms 280Summary 281
Chapter 9 Assessing Predictive Models 283
Batch Approach to Model Assessment 284Percent Correct Classifi cation 284
Rank-Ordered Approach to Model Assessment 293
Assessing Regression Models 301Summary 304
Chapter 10 Model Ensembles 307
Motivation for Ensembles 307The Wisdom of Crowds 308
Bias Variance Tradeoff 309
Bagging 311
xviii Contents
ftoc.indd 01:52:15:PM 03/28/2014 Page xviii
Boosting 316Improvements to Bagging and Boosting 320
Random Forests 320
Stochastic Gradient Boosting 321
Heterogeneous Ensembles 321
Model Ensembles and Occam’s Razor 323Interpreting Model Ensembles 323Summary 326
Chapter 11 Text Mining 327
Motivation for Text Mining 328A Predictive Modeling Approach to Text Mining 329Structured vs. Unstructured Data 329Why Text Mining Is Hard 330
Text Mining Applications 332
Data Sources for Text Mining 333
Data Preparation Steps 333POS Tagging 333
Tokens 336
Stop Word and Punctuation Filters 336
Character Length and Number Filters 337
Stemming 337
Dictionaries 338
The Sentiment Polarity Movie Data Set 339
Text Mining Features 340Term Frequency 341
Inverse Document Frequency 344
TF-IDF 344
Cosine Similarity 346
Multi-Word Features: N-Grams 346
Reducing Keyword Features 347
Grouping Terms 347
Modeling with Text Mining Features 347Regular Expressions 349
Uses of Regular Expressions in Text Mining 351
Summary 352
Chapter 12 Model Deployment 353
General Deployment Considerations 354Deployment Steps 355
Summary 375
Chapter 13 Case Studies 377
Survey Analysis Case Study: Overview 377Business Understanding: Defi ning the Problem 378
Data Understanding 380
Data Preparation 381
Modeling 385
Contents xix
ftoc.indd 01:52:15:PM 03/28/2014 Page xix
Deployment: “What-If” Analysis 391
Revisit Models 392
Deployment 401
Summary and Conclusions 401
Help Desk Case Study 402Data Understanding: Defi ning the Data 403
Data Preparation 403
Modeling 405
Revisit Business Understanding 407
Deployment 409
Summary and Conclusions 411
Index 413
xxi
fl ast.indd 01:56:26:PM 03/28/2014 Page xxi
The methods behind predictive analytics have a rich history, combining disci-
plines of mathematics, statistics, social science, and computer science. The label
“predictive analytics” is relatively new, however, coming to the forefront only in
the past decade. It stands on the shoulders of other analytics-centric fi elds such
as data mining, machine learning, statistics, and pattern recognition.
This book describes the predictive modeling process from the perspective
of a practitioner rather than a theoretician. Predictive analytics is both science
and art. The science is relatively easy to describe but to do the subject justice
requires considerable knowledge of mathematics. I don’t believe a good practi-
tioner needs to understand the mathematics of the algorithms to be able to apply
them successfully, any more than a graphic designer needs to understand the
mathematics of image correction algorithms to apply sharpening fi lters well.
However, the better practitioners understand the effects of changing modeling
parameters, the implications of the assumptions of algorithms in predictions,
and the limitations of the algorithms, the more successful they will be, especially
in the most challenging modeling projects.
Science is covered in this book, but not in the same depth you will fi nd in
academic treatments of the subject. Even though you won’t fi nd sections describ-
ing how to decompose matrices while building linear regression models, this
book does not treat algorithms as black boxes.
The book describes what I would be telling you if you were looking over my
shoulder while I was solving a problem, especially when surprises occur in
the data. How can algorithms fool us into thinking their performance is excel-
lent when they are actually brittle? Why did I bin a variable rather than just
transform it numerically? Why did I use logistic regression instead of a neural
network, or vice versa? Why did I build a linear regression model to predict a
Introduction
xxii Introduction
fl ast.indd 01:56:26:PM 03/28/2014 Page xxii
binary outcome? These are the kinds of questions, the art of predictive model-
ing, that this book addresses directly and indirectly.
I don’t claim that the approaches described in this book represent the only way
to solve problems. There are many contingencies you may encounter where the
approaches described in this book are not appropriate. I can hear the comments
already from other experienced statisticians or data miners asking, “but what
about . . . ?” or “have you tried doing this . . . ?” I’ve found over the years that
there are often many ways to solve problems successfully, and the approach you
take depends on many factors, including your personal style in solving prob-
lems and the desires of the client for how to solve the problems. However, even
more often than this, I’ve found that the biggest gains in successful modeling
come more from understanding data than from applying a more sophisticated
algorithm.
How This Book Is Organized
The book is organized around the Cross-Industry Standard Process Model for
Data Mining (CRISP-DM). While the name uses the term “data mining,” the
steps are the same in predictive analytics or any project that includes the build-
ing of predictive or statistical models.
This isn’t the only framework you can use, but it a well-documented frame-
work that has been around for more than 15 years. CRISP-DM should not be
used as a recipe with checkboxes, but rather as a guide during your predictive
modeling projects.
The six phases or steps in CRISP-DM:
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation
6. Deployment
After an introductory chapter, Business Understanding, Data Understanding,
and Data Preparation receive a chapter each in Chapters 2, 3, and 4. The Data
Understanding and Data Preparation chapters represent more than one-quarter
of the pages of technical content of the book, with Data Preparation the largest
single chapter. This is appropriate because preparing data for predictive model-
ing is nearly always considered the most time-consuming stage in a predictive
modeling project.
Introduction xxiii
fl ast.indd 01:56:26:PM 03/28/2014 Page xxiii
Chapter 5 describes association rules, usually considered a modeling method,
but one that differs enough from other descriptive and predictive modeling
techniques that it warrants a separate chapter.
The modeling methods—descriptive, predictive, and model ensembles—are
described in Chapters 6, 8, and 10. Sandwiched in between these are the model-
ing evaluation or assessment methods for descriptive and predictive modeling,
in Chapters 7 and 9, respectively. This sequence is intended to connect more
directly the modeling methods with how they are evaluated.
Chapter 11 takes a side-turn into text mining, a method of analysis of unstruc-
tured data that is becoming more mainstream in predictive modeling. Software
packages are increasingly including text mining algorithms built into the soft-
ware or as an add-on package. Chapter 12 describes the sixth phase of predictive
modeling: Model Deployment.
Finally, Chapter 13 contains two case studies: one based on work I did as a
contractor to Seer Analytics for the YMCA, and the second for a Fortune 500
company. The YMCA case study is actually two case studies built into a single
narrative because we took vastly different approaches to solve the problem.
The case studies are written from the perspective of an analyst as he considers
different ways to solve the problem, including solutions that were good ideas
but didn’t work well.
Throughout the book, I visit many of the problems in predictive modeling
that I have encountered in projects over the past 27 years, but the list certainly is
not exhaustive. Even with the inherent limitations of book format, the thought
process and principles are more important than developing a complete list of
possible approaches to solving problems.
Who Should Read This Book
This book is intended to be read by anyone currently in the fi eld of predictive
analytics or its related fi elds, including data mining, statistics, machine learn-
ing, data science, and business analytics.
For those who are just starting in the fi eld, the book will provide a description
of the core principles every modeler needs to know.
The book will also help those who wish to enter these fi elds but aren’t yet there.
It does not require a mathematics background beyond pre-calculus to under-
stand predictive analytics methods, though knowledge of calculus and linear
algebra will certainly help. The same applies to statistics. One can understand
the material in this book without a background in statistics; I, for one, have never
taken a course in statistics. However, understanding statistics certainly provides
additional intuition and insight into the approaches described in this book.
xxiv Introduction
fl ast.indd 01:56:26:PM 03/28/2014 Page xxiv
Tools You Will Need
No predictive modeling software is needed to understand the concepts in the
book, and no examples are shown that require a particular predictive analytics
software package. This was deliberate because the principles and techniques
described in this book are general, and apply to any software package. When
applicable, I describe which principles are usually available in software and
which are rarely available in software.
What’s on the Website
All data sets used in examples throughout the textbook are included on the
Wiley website at www.wiley.com/go/appliedpredictiveanalytics and in the
Resources section of www.abbottanalytics.com. These data sets include:
■ KDD Cup 1998 data, simplifi ed from the full version
■ Nasadata data set
■ Iris data set
In addition, baseline scripts, workfl ows, streams, or other documents specifi c
to predictive analytics software will be included as they become available.
These will provide baseline processing steps to replicate analyses from the
book chapters.
Summary
My hope is that this book will encourage those who want to be more effective
predictive modelers to continue to work at their craft. I’ve worked side-by-side
with dozens of predictive modelers over the years.
I always love watching how effective predictive modelers go about solving
problems, perhaps in ways I had never considered before. For those with more
experience, my hope is that this book will describe data preparation, modeling,
evaluation, and deployment in a way that even experienced modelers haven’t
thought of before.
1
c01.indd 01:52:36:PM 03/28/2014 Page 1
A small direct response company had developed dozens of programs in coop-
eration with major brands to sell books and DVDs. These affi nity programs
were very successful, but required considerable up-front work to develop the
creative content and determine which customers, already engaged with the brand,
were worth the signifi cant marketing spend to purchase the books or DVDs
on subscription. Typically, they fi rst developed test mailings on a moderately
sized sample to determine if the expected response rates were high enough to
justify a larger program.
One analyst with the company identifi ed a way to help the company become
more profi table. What if one could identify the key characteristics of those who
responded to the test mailing? Furthermore, what if one could generate a score
for these customers and determine what minimum score would result in a high
enough response rate to make the campaign profi table? The analyst discovered
predictive analytics techniques that could be used for both purposes, fi nding
key customer characteristics and using those characteristics to generate a score
that could be used to determine which customers to mail.
Two decades before, the owner of a small company in Virginia had a com-
pelling idea: Improve the accuracy and fl exibility of guided munitions using
optimal control. The owner and president, Roger Barron, began the process of
deriving the complex mathematics behind optimal control using a technique
known as variational calculus and hired a graduate student to assist him in the
task. Programmers then implemented the mathematics in computer code so
C H A P T E R
1
Overview of Predictive Analytics
2 Chapter 1 ■ Overview of Predictive Analytics
c01.indd 01:52:36:PM 03/28/2014 Page 2
they could simulate thousands of scenarios. For each trajectory, the variational
calculus minimized the miss distance while maximizing speed at impact as
well as the angle of impact.
The variational calculus algorithm succeeded in identifying the optimal
sequence of commands: how much the fi ns (control surfaces) needed to change
the path of the munition to follow the optimal path to the target. The concept
worked in simulation in the thousands of optimal trajectories that were run.
Moreover, the mathematics worked on several munitions, one of which was
the MK82 glide bomb, fi tted (in simulation) with an inertial guidance unit to
control the fi ns: an early smart-bomb.
There was a problem, however. The variational calculus was so computation-
ally complex that the small computers on-board could not solve the problem
in real time. But what if one could estimate the optimal guidance commands at
any time during the fl ight from observable characteristics of the fl ight? After
all, the guidance unit can compute where the bomb is in space, how fast it is
going, and the distance of the target that was programmed into the unit when it
was launched. If the estimates of the optimum guidance commands were close
enough to the actual optimal path, it would be near optimal and still succeed.
Predictive models were built to do exactly this. The system was called Optimal Path-to-Go guidance.
These two programs designed by two different companies seemingly could
not be more different. One program knows characteristics of people, such as
demographics and their level of engagement with a brand, and tries to predict
a human decision. The second program knows locations of a bomb in space and
tries to predict the best physical action for it to hit a target.
But they share something in common: They both need to estimate values that
are unknown but tremendously useful. For the affi nity programs, the models
estimate whether or not an individual will respond to a campaign, and for the
guidance program, the models estimate the best guidance command. In this
sense, these two programs are very similar because they both involve predict-
ing a value or values that are known historically, but are unknown at the time
a decision is needed. Not only are these programs related in this sense, but they
are far from unique; there are countless decisions businesses and government
agencies make every day that can be improved by using historic data as an aid
to making decisions or even to automate the decisions themselves.
This book describes the back-story behind how analysts build the predictive
models like the ones described in these two programs. There is science behind
much of what predictive modelers do, yet there is also plenty of art, where no
theory can inform us as to the best action, but experience provides principles
by which tradeoffs can be made as solutions are found. Without the art, the sci-
ence would only be able to solve a small subset of problems we face. Without
Chapter 1 ■ Overview of Predictive Analytics 3
c01.indd 01:52:36:PM 03/28/2014 Page 3
the science, we would be like a plane without a rudder or a kite without a tail,
moving at a rapid pace without any control, unable to achieve our objectives.
What Is Analytics?
Analytics is the process of using computational methods to discover and report
infl uential patterns in data. The goal of analytics is to gain insight and often
to affect decisions. Data is necessarily a measure of historic information so, by
defi nition, analytics examines historic data. The term itself rose to prominence
in 2005, in large part due to the introduction of Google Analytics. Nevertheless,
the ideas behind analytics are not new at all but have been represented by dif-
ferent terms throughout the decades, including cybernetics, data analysis, neural networks, pattern recognition, statistics, knowledge discovery, data mining, and now
even data science.The rise of analytics in recent years is pragmatic: As organizations collect more
data and begin to summarize it, there is a natural progression toward using
the data to improve estimates, forecasts, decisions, and ultimately, effi ciency.
What Is Predictive Analytics?
Predictive analytics is the process of discovering interesting and meaningful
patterns in data. It draws from several related disciplines, some of which have
been used to discover patterns in data for more than 100 years, including pat-
tern recognition, statistics, machine learning, artifi cial intelligence, and data
mining. What differentiates predictive analytics from other types of analytics?
First, predictive analytics is data-driven, meaning that algorithms derive key
characteristic of the models from the data itself rather than from assumptions
made by the analyst. Put another way, data-driven algorithms induce models
from the data. The induction process can include identifi cation of variables to
be included in the model, parameters that defi ne the model, weights or coef-
fi cients in the model, or model complexity.
Second, predictive analytics algorithms automate the process of fi nding the
patterns from the data. Powerful induction algorithms not only discover coef-
fi cients or weights for the models, but also the very form of the models. Decision
trees algorithms, for example, learn which of the candidate inputs best predict
a target variable in addition to identifying which values of the variables to use
in building predictions. Other algorithms can be modifi ed to perform searches,
using exhaustive or greedy searches to fi nd the best set of inputs and model
parameters. If the variable helps reduce model error, the variable is included
4 Chapter 1 ■ Overview of Predictive Analytics
c01.indd 01:52:36:PM 03/28/2014 Page 4
in the model. Otherwise, if the variable does not help to reduce model error, it
is eliminated.
Another automation task available in many software packages and algorithms
automates the process of transforming input variables so that they can be used
effectively in the predictive models. For example, if there are a hundred variables
that are candidate inputs to models that can be or should be transformed to
remove skew, you can do this with some predictive analytics software in a single
step rather than programming all one hundred transformations one at a time.
Predictive analytics doesn’t do anything that any analyst couldn’t accomplish
with pencil and paper or a spreadsheet if given enough time; the algorithms,
while powerful, have no common sense. Consider a supervised learning
data set with 50 inputs and a single binary target variable with values 0 and
1. One way to try to identify which of the inputs is most related to the target
variable is to plot each variable, one at a time, in a histogram. The target vari-
able can be superimposed on the histogram, as shown in Figure 1-1. With 50
inputs, you need to look at 50 histograms. This is not uncommon for predic-
tive modelers to do.
If the patterns require examining two variables at a time, you can do so with
a scatter plot. For 50 variables, there are 1,225 possible scatter plots to examine.
A dedicated predictive modeler might actually do this, although it will take
some time. However, if the patterns require that you examine three variables
simultaneously, you would need to examine 19,600 3D scatter plots in order to
examine all the possible three-way combinations. Even the most dedicated mod-
elers will be hard-pressed to spend the time needed to examine so many plots.
0[−0.926– −0.658] [−0.122–0.148] [0.688–0.951] [1.498–1.768]
102
204
306
408
510
612
714
816
918
1020
1129
Figure 1-1: Histogram
You need algorithms to sift through all of the potential combinations of inputs
in the data—the patterns—and identify which ones are the most interesting. The
analyst can then focus on these patterns, undoubtedly a much smaller number
of inputs to examine. Of the 19,600 three-way combinations of inputs, it may