30

ffi rs.indd 01:56:13:PM 03/28/2014 Page ivdownload.e-bookshelf.de/download/0002/3415/42/L-G-0002341542... · to teach principles of data mining and predictive analytics to business

  • Upload
    lybao

  • View
    215

  • Download
    2

Embed Size (px)

Citation preview

ffi rs.indd 01:56:13:PM 03/28/2014 Page iv

ffi rs.indd 01:56:13:PM 03/28/2014 Page i

Applied Predictive Analytics

Principles and Techniques for the Professional Data Analyst

Dean Abbott

ffi rs.indd 01:56:13:PM 03/28/2014 Page ii

Applied Predictive Analytics: Principles and Techniques for the Professional Data Analyst

Published by

John Wiley & Sons, Inc.

10475 Crosspoint Boulevard

Indianapolis, IN 46256

www.wiley.com

Copyright © 2014 by John Wiley & Sons, Inc., Indianapolis, Indiana

Published simultaneously in Canada

ISBN: 978-1-118-72796-6

ISBN: 978-1-118-72793-5 (ebk)

ISBN: 978-1-118-72769-0 (ebk)

Manufactured in the United States of America

10 9 8 7 6 5 4 3 2 1

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means,

electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or

108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or autho-

rization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive,

Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed

to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201)

748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with

respect to the accuracy or completeness of the contents of this work and specifi cally disclaim all warranties, including

without limitation warranties of fi tness for a particular purpose. No warranty may be created or extended by sales or

promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work

is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional

services. If professional assistance is required, the services of a competent professional person should be sought.

Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or

Web site is referred to in this work as a citation and/or a potential source of further information does not mean that

the author or the publisher endorses the information the organization or website may provide or recommendations

it may make. Further, readers should be aware that Internet websites listed in this work may have changed or disap-

peared between when this work was written and when it is read.

For general information on our other products and services please contact our Customer Care Department within the

United States at (877) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with

standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to

media such as a CD or DVD that is not included in the version you purchased, you may download this material at

http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.

Library of Congress Control Number: 2013958302

Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or

its affi liates, in the United States and other countries, and may not be used without written permission. [Insert third-

party trademark information] All other trademarks are the property of their respective owners. John Wiley & Sons,

Inc. is not associated with any product or vendor mentioned in this book.

ffi rs.indd 01:56:13:PM 03/28/2014 Page iii

To Barbara

ffi rs.indd 01:56:13:PM 03/28/2014 Page iv

v

ffi rs.indd 01:56:13:PM 03/28/2014 Page v

Dean Abbott is President of Abbott Analytics, Inc. in San Diego, California.

Dean is an internationally recognized data-mining and predictive analytics

expert with over two decades of experience applying advanced modeling and

data preparation techniques to a wide variety of real-world problems. He is

also Chief Data Scientist of SmarterRemarketer, a startup company focusing

on data-driven behavior segmentation and attribution.

Dean has been recognized as a top-ten data scientist and one of the top ten

most infl uential people in data analytics. His blog has been recognized as one

of the top-ten predictive analytics blogs to follow.

He is a regular speaker at Predictive Analytics World and other analytics

conferences. He is on the advisory board for the University of California Irvine

Certifi cate Program for predictive analytics and the University of California

San Diego Certifi cate Program for data mining, and is a regular instructor for

courses on predictive modeling algorithms, model deployment, and text min-

ing. He has also served several times on the program committee for the KDD

Conference Industrial Track.

About the Author

ffi rs.indd 01:56:13:PM 03/28/2014 Page vi

vii

ffi rs.indd 01:56:13:PM 03/28/2014 Page vii

William J. Komp has a Ph.D. from the University of Wisconsin–Milwaukee,

with a specialization in the fi elds of general relativity and cosmology. He has

been a professor of physics at the University of Louisville and Western Kentucky

University. Currently, he is a research scientist at Humana, Inc., working in the

areas of predictive analytics and data mining.

About the Technical Editor

ffi rs.indd 01:56:13:PM 03/28/2014 Page viii

ix

ffi rs.indd 01:56:13:PM 03/28/2014 Page ix

Executive EditorRobert Elliott

Project EditorAdaobi Obi Tulton

Technical EditorWilliam J. Komp

Senior Production EditorKathleen Wisor

Copy EditorNancy Rapoport

Manager of Content Development and AssemblyMary Beth Wakefi eld

Director of Community MarketingDavid Mayhew

Marketing ManagerAshley Zurcher

Business ManagerAmy Knies

Vice President and Executive Group PublisherRichard Swadley

Associate PublisherJim Minatel

Project Coordinator, CoverTodd Klemme

ProofreaderNancy Carrasco

IndexerJohnna VanHoose Dinse

Cover DesignerRyan Sneed

Credits

ffi rs.indd 01:56:13:PM 03/28/2014 Page x

xi

ffi rs.indd 01:56:13:PM 03/28/2014 Page xi

The idea for this book began with a phone call from editor Bob Elliott, who pre-

sented the idea of writing a different kind of predictive analytics book geared

toward business professionals. My passion for more than a decade has been

to teach principles of data mining and predictive analytics to business profes-

sionals, translating the lingo of mathematics and statistics into a language the

practitioner can understand. The questions of hundreds of course and work-

shop attendees forced me to think about why we do what we do in predictive

analytics. I also thank Bob for not only persuading me that I could write the

book while continuing to be a consultant, but also for his advice on the scope

and depth of topics to cover.

I thank my father for encouraging me in analytics. I remember him teaching

me how to compute batting average and earned run average when I was eight

years old so I could compute my Little League statistics. He brought home reams

of accounting pads, which I used for playing thousands of Strat-O-Matic baseball

games, just so that I could compute everyone’s batting average and earned run

average, and see if there were signifi cant differences between what I observed

and what the players’ actual statistics were. My parents put up with a lot of

paper strewn across the fl oor for many years.

I would never have been in this fi eld were it not for Roger Barron, my fi rst

boss at Barron Associates, Inc., a pioneer in statistical learning methods and

a man who taught me to be curious, thorough, and persistent about data analysis.

His ability to envision solutions without knowing exactly how they would be

solved is something I refl ect on often.

I’ve learned much over the past 27 years from John Elder, a friend and col-

league, since our time at Barron Associates, Inc. and continuing to this day.

I am very grateful to Eric Siegel for inviting me to speak regularly at Predictive

Acknowledgments

xii Acknowledgments

ffi rs.indd 01:56:13:PM 03/28/2014 Page xii

Analytics World in sessions and workshops, and for his advice and encourage-

ment in the writing of this book.

A very special thanks goes to editors Adaobi Obi Tulton and Nancy Rapoport

for making sense of what of I was trying to communicate and making this book

more concise and clearer than I was able to do alone. Obviously, I was a math-

ematics major and not an English major. I am especially grateful for technical

editor William Komp, whose insightful comments throughout the book helped

me to sharpen points I was making.

Several software packages were used in the analyses contained in the book,

including KNIME, IBM SPSS Modeler, JMP, Statistica, Predixion, and Orange.

I thank all of these vendors for creating software that is easy to use. I also want

to thank all of the other vendors I’ve worked with over the past two decades,

who have supplied software for me to use in teaching and research.

On a personal note, this book project could not have taken place without the

support of my wife, Barbara, who encouraged me throughout the project, even

as it wore on and seemed to never end. She put up with me as I spent countless

nights and Saturdays writing, and cheered with each chapter that was submit-

ted. My two children still at home were very patient with a father who wasn’t

as available as usual for nearly a year.

As a Christian, my worldview is shaped by God and how He reveals Himself

in the Bible, through his Son, and by his Holy Spirit. When I look back at the

circumstances and people He put into my life, I only marvel at His hand

of providence, and am thankful that I’ve been able to enjoy this fi eld for my

entire career.

xiii

ftoc.indd 01:52:15:PM 03/28/2014 Page xiii

Introduction xxi

Chapter 1 Overview of Predictive Analytics 1

What Is Analytics? 3What Is Predictive Analytics? 3

Supervised vs. Unsupervised Learning 5

Parametric vs. Non-Parametric Models 6

Business Intelligence 6Predictive Analytics vs. Business Intelligence 8

Do Predictive Models Just State the Obvious? 9

Similarities between Business Intelligence

and Predictive Analytics 9

Predictive Analytics vs. Statistics 10Statistics and Analytics 11

Predictive Analytics and Statistics Contrasted 12

Predictive Analytics vs. Data Mining 13Who Uses Predictive Analytics? 13Challenges in Using Predictive Analytics 14

Obstacles in Management 14

Obstacles with Data 14

Obstacles with Modeling 15

Obstacles in Deployment 16

What Educational Background Is Needed to Become a Predictive Modeler? 16

Chapter 2 Setting Up the Problem 19

Predictive Analytics Processing Steps: CRISP-DM 19Business Understanding 21

The Three-Legged Stool 22

Business Objectives 23

Contents

xiv Contents

ftoc.indd 01:52:15:PM 03/28/2014 Page xiv

Defi ning Data for Predictive Modeling 25Defi ning the Columns as Measures 26

Defi ning the Unit of Analysis 27

Which Unit of Analysis? 28

Defi ning the Target Variable 29Temporal Considerations for Target Variable 31

Defi ning Measures of Success for Predictive Models 32Success Criteria for Classifi cation 32

Success Criteria for Estimation 33

Other Customized Success Criteria 33

Doing Predictive Modeling Out of Order 34Building Models First 34

Early Model Deployment 35

Case Study: Recovering Lapsed Donors 35Overview 36

Business Objectives 36

Data for the Competition 36

The Target Variables 36

Modeling Objectives 37

Model Selection and Evaluation Criteria 38

Model Deployment 39

Case Study: Fraud Detection 39Overview 39

Business Objectives 39

Data for the Project 40

The Target Variables 40

Modeling Objectives 41

Model Selection and Evaluation Criteria 41

Model Deployment 41

Summary 42

Chapter 3 Data Understanding 43

What the Data Looks Like 44Single Variable Summaries 44

Mean 45

Standard Deviation 45

The Normal Distribution 45

Uniform Distribution 46

Applying Simple Statistics in Data Understanding 47

Skewness 49

Kurtosis 51

Rank-Ordered Statistics 52

Categorical Variable Assessment 55

Data Visualization in One Dimension 58Histograms 59Multiple Variable Summaries 64

Contents xv

ftoc.indd 01:52:15:PM 03/28/2014 Page xv

Hidden Value in Variable Interactions: Simpson’s Paradox 64

The Combinatorial Explosion of Interactions 65

Correlations 66

Spurious Correlations 66

Back to Correlations 67

Crosstabs 68

Data Visualization, Two or Higher Dimensions 69Scatterplots 69

Anscombe’s Quartet 71

Scatterplot Matrices 75

Overlaying the Target Variable in Summary 76

Scatterplots in More Than Two Dimensions 78

The Value of Statistical Signifi cance 80Pulling It All Together into a Data Audit 81Summary 82

Chapter 4 Data Preparation 83

Variable Cleaning 84Incorrect Values 84

Consistency in Data Formats 85

Outliers 85

Multidimensional Outliers 89

Missing Values 90

Fixing Missing Data 91

Feature Creation 98Simple Variable Transformations 98

Fixing Skew 99

Binning Continuous Variables 103

Numeric Variable Scaling 104

Nominal Variable Transformation 107

Ordinal Variable Transformations 108

Date and Time Variable Features 109

ZIP Code Features 110

Which Version of a Variable Is Best? 110

Multidimensional Features 112

Variable Selection Prior to Modeling 117

Sampling 123

Example: Why Normalization Matters for K-Means Clustering 139

Summary 143

Chapter 5 Itemsets and Association Rules 145

Terminology 146Condition 147

Left-Hand-Side, Antecedent(s) 148

Right-Hand-Side, Consequent, Output, Conclusion 148

Rule (Item Set) 148

xvi Contents

ftoc.indd 01:52:15:PM 03/28/2014 Page xvi

Support 149

Antecedent Support 149

Confi dence, Accuracy 150

Lift 150

Parameter Settings 151How the Data Is Organized 151

Standard Predictive Modeling Data Format 151

Transactional Format 152

Measures of Interesting Rules 154Deploying Association Rules 156

Variable Selection 157

Interaction Variable Creation 157

Problems with Association Rules 158Redundant Rules 158

Too Many Rules 158

Too Few Rules 159

Building Classifi cation Rules from Association Rules 159Summary 161

Chapter 6 Descriptive Modeling 163

Data Preparation Issues with Descriptive Modeling 164Principal Component Analysis 165

The PCA Algorithm 165

Applying PCA to New Data 169

PCA for Data Interpretation 171

Additional Considerations before Using PCA 172

The Effect of Variable Magnitude on PCA Models 174

Clustering Algorithms 177The K-Means Algorithm 178

Data Preparation for K-Means 183

Selecting the Number of Clusters 185

The Kohonen SOM Algorithm 192

Visualizing Kohonen Maps 194

Similarities with K-Means 196

Summary 197

Chapter 7 Interpreting Descriptive Models 199

Standard Cluster Model Interpretation 199Problems with Interpretation Methods 202

Identifying Key Variables in Forming Cluster Models 203

Cluster Prototypes 209

Cluster Outliers 210

Summary 212

Chapter 8 Predictive Modeling 213

Decision Trees 214The Decision Tree Landscape 215

Building Decision Trees 218

Contents xvii

ftoc.indd 01:52:15:PM 03/28/2014 Page xvii

Decision Tree Splitting Metrics 221

Decision Tree Knobs and Options 222

Reweighting Records: Priors 224

Reweighting Records: Misclassifi cation Costs 224

Other Practical Considerations for Decision Trees 229

Logistic Regression 230Interpreting Logistic Regression Models 233

Other Practical Considerations for Logistic Regression 235

Neural Networks 240Building Blocks: The Neuron 242

Neural Network Training 244

The Flexibility of Neural Networks 247

Neural Network Settings 249

Neural Network Pruning 251

Interpreting Neural Networks 252

Neural Network Decision Boundaries 253

Other Practical Considerations for Neural Networks 253

K-Nearest Neighbor 254The k-NN Learning Algorithm 254

Distance Metrics for k-NN 258

Other Practical Considerations for k-NN 259

Naïve Bayes 264Bayes’ Theorem 264

The Naïve Bayes Classifi er 268

Interpreting Naïve Bayes Classifi ers 268

Other Practical Considerations for Naïve Bayes 269

Regression Models 270Linear Regression 271

Linear Regression Assumptions 274

Variable Selection in Linear Regression 276

Interpreting Linear Regression Models 278

Using Linear Regression for Classifi cation 279

Other Regression Algorithms 280Summary 281

Chapter 9 Assessing Predictive Models 283

Batch Approach to Model Assessment 284Percent Correct Classifi cation 284

Rank-Ordered Approach to Model Assessment 293

Assessing Regression Models 301Summary 304

Chapter 10 Model Ensembles 307

Motivation for Ensembles 307The Wisdom of Crowds 308

Bias Variance Tradeoff 309

Bagging 311

xviii Contents

ftoc.indd 01:52:15:PM 03/28/2014 Page xviii

Boosting 316Improvements to Bagging and Boosting 320

Random Forests 320

Stochastic Gradient Boosting 321

Heterogeneous Ensembles 321

Model Ensembles and Occam’s Razor 323Interpreting Model Ensembles 323Summary 326

Chapter 11 Text Mining 327

Motivation for Text Mining 328A Predictive Modeling Approach to Text Mining 329Structured vs. Unstructured Data 329Why Text Mining Is Hard 330

Text Mining Applications 332

Data Sources for Text Mining 333

Data Preparation Steps 333POS Tagging 333

Tokens 336

Stop Word and Punctuation Filters 336

Character Length and Number Filters 337

Stemming 337

Dictionaries 338

The Sentiment Polarity Movie Data Set 339

Text Mining Features 340Term Frequency 341

Inverse Document Frequency 344

TF-IDF 344

Cosine Similarity 346

Multi-Word Features: N-Grams 346

Reducing Keyword Features 347

Grouping Terms 347

Modeling with Text Mining Features 347Regular Expressions 349

Uses of Regular Expressions in Text Mining 351

Summary 352

Chapter 12 Model Deployment 353

General Deployment Considerations 354Deployment Steps 355

Summary 375

Chapter 13 Case Studies 377

Survey Analysis Case Study: Overview 377Business Understanding: Defi ning the Problem 378

Data Understanding 380

Data Preparation 381

Modeling 385

Contents xix

ftoc.indd 01:52:15:PM 03/28/2014 Page xix

Deployment: “What-If” Analysis 391

Revisit Models 392

Deployment 401

Summary and Conclusions 401

Help Desk Case Study 402Data Understanding: Defi ning the Data 403

Data Preparation 403

Modeling 405

Revisit Business Understanding 407

Deployment 409

Summary and Conclusions 411

Index 413

xxi

fl ast.indd 01:56:26:PM 03/28/2014 Page xxi

The methods behind predictive analytics have a rich history, combining disci-

plines of mathematics, statistics, social science, and computer science. The label

“predictive analytics” is relatively new, however, coming to the forefront only in

the past decade. It stands on the shoulders of other analytics-centric fi elds such

as data mining, machine learning, statistics, and pattern recognition.

This book describes the predictive modeling process from the perspective

of a practitioner rather than a theoretician. Predictive analytics is both science

and art. The science is relatively easy to describe but to do the subject justice

requires considerable knowledge of mathematics. I don’t believe a good practi-

tioner needs to understand the mathematics of the algorithms to be able to apply

them successfully, any more than a graphic designer needs to understand the

mathematics of image correction algorithms to apply sharpening fi lters well.

However, the better practitioners understand the effects of changing modeling

parameters, the implications of the assumptions of algorithms in predictions,

and the limitations of the algorithms, the more successful they will be, especially

in the most challenging modeling projects.

Science is covered in this book, but not in the same depth you will fi nd in

academic treatments of the subject. Even though you won’t fi nd sections describ-

ing how to decompose matrices while building linear regression models, this

book does not treat algorithms as black boxes.

The book describes what I would be telling you if you were looking over my

shoulder while I was solving a problem, especially when surprises occur in

the data. How can algorithms fool us into thinking their performance is excel-

lent when they are actually brittle? Why did I bin a variable rather than just

transform it numerically? Why did I use logistic regression instead of a neural

network, or vice versa? Why did I build a linear regression model to predict a

Introduction

xxii Introduction

fl ast.indd 01:56:26:PM 03/28/2014 Page xxii

binary outcome? These are the kinds of questions, the art of predictive model-

ing, that this book addresses directly and indirectly.

I don’t claim that the approaches described in this book represent the only way

to solve problems. There are many contingencies you may encounter where the

approaches described in this book are not appropriate. I can hear the comments

already from other experienced statisticians or data miners asking, “but what

about . . . ?” or “have you tried doing this . . . ?” I’ve found over the years that

there are often many ways to solve problems successfully, and the approach you

take depends on many factors, including your personal style in solving prob-

lems and the desires of the client for how to solve the problems. However, even

more often than this, I’ve found that the biggest gains in successful modeling

come more from understanding data than from applying a more sophisticated

algorithm.

How This Book Is Organized

The book is organized around the Cross-Industry Standard Process Model for

Data Mining (CRISP-DM). While the name uses the term “data mining,” the

steps are the same in predictive analytics or any project that includes the build-

ing of predictive or statistical models.

This isn’t the only framework you can use, but it a well-documented frame-

work that has been around for more than 15 years. CRISP-DM should not be

used as a recipe with checkboxes, but rather as a guide during your predictive

modeling projects.

The six phases or steps in CRISP-DM:

1. Business Understanding

2. Data Understanding

3. Data Preparation

4. Modeling

5. Evaluation

6. Deployment

After an introductory chapter, Business Understanding, Data Understanding,

and Data Preparation receive a chapter each in Chapters 2, 3, and 4. The Data

Understanding and Data Preparation chapters represent more than one-quarter

of the pages of technical content of the book, with Data Preparation the largest

single chapter. This is appropriate because preparing data for predictive model-

ing is nearly always considered the most time-consuming stage in a predictive

modeling project.

Introduction xxiii

fl ast.indd 01:56:26:PM 03/28/2014 Page xxiii

Chapter 5 describes association rules, usually considered a modeling method,

but one that differs enough from other descriptive and predictive modeling

techniques that it warrants a separate chapter.

The modeling methods—descriptive, predictive, and model ensembles—are

described in Chapters 6, 8, and 10. Sandwiched in between these are the model-

ing evaluation or assessment methods for descriptive and predictive modeling,

in Chapters 7 and 9, respectively. This sequence is intended to connect more

directly the modeling methods with how they are evaluated.

Chapter 11 takes a side-turn into text mining, a method of analysis of unstruc-

tured data that is becoming more mainstream in predictive modeling. Software

packages are increasingly including text mining algorithms built into the soft-

ware or as an add-on package. Chapter 12 describes the sixth phase of predictive

modeling: Model Deployment.

Finally, Chapter 13 contains two case studies: one based on work I did as a

contractor to Seer Analytics for the YMCA, and the second for a Fortune 500

company. The YMCA case study is actually two case studies built into a single

narrative because we took vastly different approaches to solve the problem.

The case studies are written from the perspective of an analyst as he considers

different ways to solve the problem, including solutions that were good ideas

but didn’t work well.

Throughout the book, I visit many of the problems in predictive modeling

that I have encountered in projects over the past 27 years, but the list certainly is

not exhaustive. Even with the inherent limitations of book format, the thought

process and principles are more important than developing a complete list of

possible approaches to solving problems.

Who Should Read This Book

This book is intended to be read by anyone currently in the fi eld of predictive

analytics or its related fi elds, including data mining, statistics, machine learn-

ing, data science, and business analytics.

For those who are just starting in the fi eld, the book will provide a description

of the core principles every modeler needs to know.

The book will also help those who wish to enter these fi elds but aren’t yet there.

It does not require a mathematics background beyond pre-calculus to under-

stand predictive analytics methods, though knowledge of calculus and linear

algebra will certainly help. The same applies to statistics. One can understand

the material in this book without a background in statistics; I, for one, have never

taken a course in statistics. However, understanding statistics certainly provides

additional intuition and insight into the approaches described in this book.

xxiv Introduction

fl ast.indd 01:56:26:PM 03/28/2014 Page xxiv

Tools You Will Need

No predictive modeling software is needed to understand the concepts in the

book, and no examples are shown that require a particular predictive analytics

software package. This was deliberate because the principles and techniques

described in this book are general, and apply to any software package. When

applicable, I describe which principles are usually available in software and

which are rarely available in software.

What’s on the Website

All data sets used in examples throughout the textbook are included on the

Wiley website at www.wiley.com/go/appliedpredictiveanalytics and in the

Resources section of www.abbottanalytics.com. These data sets include:

■ KDD Cup 1998 data, simplifi ed from the full version

■ Nasadata data set

■ Iris data set

In addition, baseline scripts, workfl ows, streams, or other documents specifi c

to predictive analytics software will be included as they become available.

These will provide baseline processing steps to replicate analyses from the

book chapters.

Summary

My hope is that this book will encourage those who want to be more effective

predictive modelers to continue to work at their craft. I’ve worked side-by-side

with dozens of predictive modelers over the years.

I always love watching how effective predictive modelers go about solving

problems, perhaps in ways I had never considered before. For those with more

experience, my hope is that this book will describe data preparation, modeling,

evaluation, and deployment in a way that even experienced modelers haven’t

thought of before.

1

c01.indd 01:52:36:PM 03/28/2014 Page 1

A small direct response company had developed dozens of programs in coop-

eration with major brands to sell books and DVDs. These affi nity programs

were very successful, but required considerable up-front work to develop the

creative content and determine which customers, already engaged with the brand,

were worth the signifi cant marketing spend to purchase the books or DVDs

on subscription. Typically, they fi rst developed test mailings on a moderately

sized sample to determine if the expected response rates were high enough to

justify a larger program.

One analyst with the company identifi ed a way to help the company become

more profi table. What if one could identify the key characteristics of those who

responded to the test mailing? Furthermore, what if one could generate a score

for these customers and determine what minimum score would result in a high

enough response rate to make the campaign profi table? The analyst discovered

predictive analytics techniques that could be used for both purposes, fi nding

key customer characteristics and using those characteristics to generate a score

that could be used to determine which customers to mail.

Two decades before, the owner of a small company in Virginia had a com-

pelling idea: Improve the accuracy and fl exibility of guided munitions using

optimal control. The owner and president, Roger Barron, began the process of

deriving the complex mathematics behind optimal control using a technique

known as variational calculus and hired a graduate student to assist him in the

task. Programmers then implemented the mathematics in computer code so

C H A P T E R

1

Overview of Predictive Analytics

2 Chapter 1 ■ Overview of Predictive Analytics

c01.indd 01:52:36:PM 03/28/2014 Page 2

they could simulate thousands of scenarios. For each trajectory, the variational

calculus minimized the miss distance while maximizing speed at impact as

well as the angle of impact.

The variational calculus algorithm succeeded in identifying the optimal

sequence of commands: how much the fi ns (control surfaces) needed to change

the path of the munition to follow the optimal path to the target. The concept

worked in simulation in the thousands of optimal trajectories that were run.

Moreover, the mathematics worked on several munitions, one of which was

the MK82 glide bomb, fi tted (in simulation) with an inertial guidance unit to

control the fi ns: an early smart-bomb.

There was a problem, however. The variational calculus was so computation-

ally complex that the small computers on-board could not solve the problem

in real time. But what if one could estimate the optimal guidance commands at

any time during the fl ight from observable characteristics of the fl ight? After

all, the guidance unit can compute where the bomb is in space, how fast it is

going, and the distance of the target that was programmed into the unit when it

was launched. If the estimates of the optimum guidance commands were close

enough to the actual optimal path, it would be near optimal and still succeed.

Predictive models were built to do exactly this. The system was called Optimal Path-to-Go guidance.

These two programs designed by two different companies seemingly could

not be more different. One program knows characteristics of people, such as

demographics and their level of engagement with a brand, and tries to predict

a human decision. The second program knows locations of a bomb in space and

tries to predict the best physical action for it to hit a target.

But they share something in common: They both need to estimate values that

are unknown but tremendously useful. For the affi nity programs, the models

estimate whether or not an individual will respond to a campaign, and for the

guidance program, the models estimate the best guidance command. In this

sense, these two programs are very similar because they both involve predict-

ing a value or values that are known historically, but are unknown at the time

a decision is needed. Not only are these programs related in this sense, but they

are far from unique; there are countless decisions businesses and government

agencies make every day that can be improved by using historic data as an aid

to making decisions or even to automate the decisions themselves.

This book describes the back-story behind how analysts build the predictive

models like the ones described in these two programs. There is science behind

much of what predictive modelers do, yet there is also plenty of art, where no

theory can inform us as to the best action, but experience provides principles

by which tradeoffs can be made as solutions are found. Without the art, the sci-

ence would only be able to solve a small subset of problems we face. Without

Chapter 1 ■ Overview of Predictive Analytics 3

c01.indd 01:52:36:PM 03/28/2014 Page 3

the science, we would be like a plane without a rudder or a kite without a tail,

moving at a rapid pace without any control, unable to achieve our objectives.

What Is Analytics?

Analytics is the process of using computational methods to discover and report

infl uential patterns in data. The goal of analytics is to gain insight and often

to affect decisions. Data is necessarily a measure of historic information so, by

defi nition, analytics examines historic data. The term itself rose to prominence

in 2005, in large part due to the introduction of Google Analytics. Nevertheless,

the ideas behind analytics are not new at all but have been represented by dif-

ferent terms throughout the decades, including cybernetics, data analysis, neural networks, pattern recognition, statistics, knowledge discovery, data mining, and now

even data science.The rise of analytics in recent years is pragmatic: As organizations collect more

data and begin to summarize it, there is a natural progression toward using

the data to improve estimates, forecasts, decisions, and ultimately, effi ciency.

What Is Predictive Analytics?

Predictive analytics is the process of discovering interesting and meaningful

patterns in data. It draws from several related disciplines, some of which have

been used to discover patterns in data for more than 100 years, including pat-

tern recognition, statistics, machine learning, artifi cial intelligence, and data

mining. What differentiates predictive analytics from other types of analytics?

First, predictive analytics is data-driven, meaning that algorithms derive key

characteristic of the models from the data itself rather than from assumptions

made by the analyst. Put another way, data-driven algorithms induce models

from the data. The induction process can include identifi cation of variables to

be included in the model, parameters that defi ne the model, weights or coef-

fi cients in the model, or model complexity.

Second, predictive analytics algorithms automate the process of fi nding the

patterns from the data. Powerful induction algorithms not only discover coef-

fi cients or weights for the models, but also the very form of the models. Decision

trees algorithms, for example, learn which of the candidate inputs best predict

a target variable in addition to identifying which values of the variables to use

in building predictions. Other algorithms can be modifi ed to perform searches,

using exhaustive or greedy searches to fi nd the best set of inputs and model

parameters. If the variable helps reduce model error, the variable is included

4 Chapter 1 ■ Overview of Predictive Analytics

c01.indd 01:52:36:PM 03/28/2014 Page 4

in the model. Otherwise, if the variable does not help to reduce model error, it

is eliminated.

Another automation task available in many software packages and algorithms

automates the process of transforming input variables so that they can be used

effectively in the predictive models. For example, if there are a hundred variables

that are candidate inputs to models that can be or should be transformed to

remove skew, you can do this with some predictive analytics software in a single

step rather than programming all one hundred transformations one at a time.

Predictive analytics doesn’t do anything that any analyst couldn’t accomplish

with pencil and paper or a spreadsheet if given enough time; the algorithms,

while powerful, have no common sense. Consider a supervised learning

data set with 50 inputs and a single binary target variable with values 0 and

1. One way to try to identify which of the inputs is most related to the target

variable is to plot each variable, one at a time, in a histogram. The target vari-

able can be superimposed on the histogram, as shown in Figure 1-1. With 50

inputs, you need to look at 50 histograms. This is not uncommon for predic-

tive modelers to do.

If the patterns require examining two variables at a time, you can do so with

a scatter plot. For 50 variables, there are 1,225 possible scatter plots to examine.

A dedicated predictive modeler might actually do this, although it will take

some time. However, if the patterns require that you examine three variables

simultaneously, you would need to examine 19,600 3D scatter plots in order to

examine all the possible three-way combinations. Even the most dedicated mod-

elers will be hard-pressed to spend the time needed to examine so many plots.

0[−0.926– −0.658] [−0.122–0.148] [0.688–0.951] [1.498–1.768]

102

204

306

408

510

612

714

816

918

1020

1129

Figure 1-1: Histogram

You need algorithms to sift through all of the potential combinations of inputs

in the data—the patterns—and identify which ones are the most interesting. The

analyst can then focus on these patterns, undoubtedly a much smaller number

of inputs to examine. Of the 19,600 three-way combinations of inputs, it may