10
DATA MINING AND BUSINESS ANALYTICS WITH R

Data Mining and Business Analytics with R (Ledolter/Data Mining and Business Analytics with R) || Frontmatter

Embed Size (px)

Citation preview

Page 1: Data Mining and Business Analytics with R (Ledolter/Data Mining and Business Analytics with R) || Frontmatter

DATA MINING ANDBUSINESS ANALYTICS WITH R

Page 2: Data Mining and Business Analytics with R (Ledolter/Data Mining and Business Analytics with R) || Frontmatter

DATA MINING ANDBUSINESS ANALYTICSWITH R

Johannes LedolterDepartment of Management SciencesTippie College of BusinessUniversity of IowaIowa City, Iowa

Page 3: Data Mining and Business Analytics with R (Ledolter/Data Mining and Business Analytics with R) || Frontmatter

Copyright 2013 by John Wiley & Sons, Inc. All rights reserved

Published by John Wiley & Sons, Inc., Hoboken, New JerseyPublished simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any formor by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except aspermitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the priorwritten permission of the Publisher, or authorization through payment of the appropriate per-copy feeto the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400,fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permissionshould be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street,Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online athttp://www.wiley.com/go/permission.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best effortsin preparing this book, they make no representations or warranties with respect to the accuracy orcompleteness of the contents of this book and specifically disclaim any implied warranties ofmerchantability or fitness for a particular purpose. No warranty may be created or extended by salesrepresentatives or written sales materials. The advice and strategies contained herein may not besuitable for your situation. You should consult with a professional where appropriate. Neither thepublisher nor author shall be liable for any loss of profit or any other commercial damages, includingbut not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact ourCustomer Care Department within the United States at (800) 762-2974, outside the United States at(317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in printmay not be available in electronic formats. For more information about Wiley products, visit our website at www.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Ledolter, Johannes.Data mining and business analytics with R / Johannes Ledolter, University of Iowa.

pages cmIncludes bibliographical references and index.ISBN 978-1-118-44714-7 (cloth)

1. Data mining. 2. R (Computer program language) 3. Commercial statistics. I. Title.QA76.9.D343L44 2013006.3′12–dc23

2013000330

Printed in the United States of America

10 9 8 7 6 5 4 3 2 1

Page 4: Data Mining and Business Analytics with R (Ledolter/Data Mining and Business Analytics with R) || Frontmatter

CONTENTS

Preface ixAcknowledgments xi

1. Introduction 1

Reference 6

2. Processing the Information and Getting to Know Your Data 7

2.1 Example 1: 2006 Birth Data 72.2 Example 2: Alumni Donations 172.3 Example 3: Orange Juice 31

References 39

3. Standard Linear Regression 40

3.1 Estimation in R 433.2 Example 1: Fuel Efficiency of Automobiles 433.3 Example 2: Toyota Used-Car Prices 47Appendix 3.A The Effects of Model Overfitting on the Average

Mean Square Error of the Regression Prediction 53References 54

4. Local Polynomial Regression: a Nonparametric RegressionApproach 55

4.1 Model Selection 564.2 Application to Density Estimation and the Smoothing

of Histograms 584.3 Extension to the Multiple Regression Model 584.4 Examples and Software 58

References 65

5. Importance of Parsimony in Statistical Modeling 67

5.1 How Do We Guard Against False Discovery 67References 70

v

Page 5: Data Mining and Business Analytics with R (Ledolter/Data Mining and Business Analytics with R) || Frontmatter

vi CONTENTS

6. Penalty-Based Variable Selection in Regression Models withMany Parameters (LASSO) 71

6.1 Example 1: Prostate Cancer 746.2 Example 2: Orange Juice 78

References 82

7. Logistic Regression 83

7.1 Building a Linear Model for Binary Response Data 837.2 Interpretation of the Regression Coefficients in a Logistic

Regression Model 857.3 Statistical Inference 857.4 Classification of New Cases 867.5 Estimation in R 877.6 Example 1: Death Penalty Data 877.7 Example 2: Delayed Airplanes 927.8 Example 3: Loan Acceptance 1007.9 Example 4: German Credit Data 103

References 107

8. Binary Classification, Probabilities, and Evaluating ClassificationPerformance 108

8.1 Binary Classification 1088.2 Using Probabilities to Make Decisions 1088.3 Sensitivity and Specificity 1098.4 Example: German Credit Data 109

9. Classification Using a Nearest Neighbor Analysis 115

9.1 The k-Nearest Neighbor Algorithm 1169.2 Example 1: Forensic Glass 1179.3 Example 2: German Credit Data 122

Reference 125

10. The Naı̈ve Bayesian Analysis: a Model for Predictinga Categorical Response from Mostly CategoricalPredictor Variables 126

10.1 Example: Delayed Airplanes 127Reference 131

11. Multinomial Logistic Regression 132

11.1 Computer Software 13411.2 Example 1: Forensic Glass 134

Page 6: Data Mining and Business Analytics with R (Ledolter/Data Mining and Business Analytics with R) || Frontmatter

CONTENTS vii

11.3 Example 2: Forensic Glass Revisited 141Appendix 11.A Specification of a Simple Triplet Matrix 147

References 149

12. More on Classification and a Discussion on Discriminant Analysis 150

12.1 Fisher’s Linear Discriminant Function 15312.2 Example 1: German Credit Data 15412.3 Example 2: Fisher Iris Data 15612.4 Example 3: Forensic Glass Data 15712.5 Example 4: MBA Admission Data 159

Reference 160

13. Decision Trees 161

13.1 Example 1: Prostate Cancer 16713.2 Example 2: Motorcycle Acceleration 17913.3 Example 3: Fisher Iris Data Revisited 182

14. Further Discussion on Regression and Classification Trees,Computer Software, and Other Useful Classification Methods 185

14.1 R Packages for Tree Construction 18514.2 Chi-Square Automatic Interaction Detection (CHAID) 18614.3 Ensemble Methods: Bagging, Boosting, and Random

Forests 18814.4 Support Vector Machines (SVM) 19214.5 Neural Networks 19214.6 The R Package Rattle: A Useful Graphical User Interface

for Data Mining 193References 195

15. Clustering 196

15.1 k -Means Clustering 19615.2 Another Way to Look at Clustering: Applying the

Expectation-Maximization (EM) Algorithm to Mixturesof Normal Distributions 204

15.3 Hierarchical Clustering Procedures 212References 219

16. Market Basket Analysis: Association Rules and Lift 220

16.1 Example 1: Online Radio 22216.2 Example 2: Predicting Income 227

References 234

Page 7: Data Mining and Business Analytics with R (Ledolter/Data Mining and Business Analytics with R) || Frontmatter

viii CONTENTS

17. Dimension Reduction: Factor Models and Principal Components 235

17.1 Example 1: European Protein Consumption 23817.2 Example 2: Monthly US Unemployment Rates 243

18. Reducing the Dimension in Regressions with MulticollinearInputs: Principal Components Regression and Partial LeastSquares 247

18.1 Three Examples 249References 257

19. Text as Data: Text Mining and Sentiment Analysis 258

19.1 Inverse Multinomial Logistic Regression 25919.2 Example 1: Restaurant Reviews 26119.3 Example 2: Political Sentiment 266Appendix 19.A Relationship Between the Gentzkow Shapiro

Estimate of “Slant” and Partial Least Squares 268References 271

20. Network Data 272

20.1 Example 1: Marriage and Power in Fifteenth CenturyFlorence 274

20.2 Example 2: Connections in a Friendship Network 278References 292

Appendix A: Exercises 293

Exercise 1 294Exercise 2 294Exercise 3 296Exercise 4 298Exercise 5 299Exercise 6 300Exercise 7 301

Appendix B: References 338

Index 341

Page 8: Data Mining and Business Analytics with R (Ledolter/Data Mining and Business Analytics with R) || Frontmatter

PREFACE

This book is about useful methods for data mining and business analytics. It iswritten for readers who want to apply these methods so that they can learn abouttheir processes and solve their problems. My objective is to provide a thoroughdiscussion of the most useful data-mining tools that goes beyond the typical “blackbox” description, and to show why these tools work.

Powerful, accurate, and flexible computing software is needed for data mining,and Excel is of little use. Although excellent data-mining software is offered byvarious commercial vendors, proprietary products are usually expensive. In thistext, I use the R Statistical Software, which is powerful and free. But the use ofR comes with start-up costs. R requires the user to write out instructions, and thewriting of program instructions will be unfamiliar to most spreadsheet users. This iswhy I provide R sample programs in the text and on the webpage that is associatedwith this book. These sample programs should smooth the transition to this verygeneral and powerful computer environment and help keep the start-up costs tousing R small.

The text combines explanations of the statistical foundation of data mining withuseful software so that the tools can be readily applied and put to use. There arecertainly better books that give a deeper description of the methods, and there arealso numerous texts that give a more complete guide to computing with R. Thisbook tries to strike a compromise that does justice to both theory and practice,at a level that can be understood by the MBA student interested in quantitativemethods. This book can be used in courses on data mining in quantitative MBAprograms and in upper-level undergraduate and graduate programs that deal withthe analysis and interpretation of large data sets. Students in business, the socialand natural sciences, medicine, and engineering should benefit from this book.The majority of the topics can be covered in a one semester course. But not everycovered topic will be useful for all audiences, and for some audiences, the coverageof certain topics will be either too advanced or too basic. By omitting some topicsand by expanding on others, one can make this book work for many differentaudiences.

Certain data-mining applications require an enormous amount of effort to justcollect the relevant information, and in such cases, the data preparation takes alot more time than the eventual modeling. In other applications, the data collectioneffort is minimal, but often one has to worry about the efficient storage and retrievalof high volume information (i.e., the “data warehousing”). Although it is veryimportant to know how to acquire, store, merge, and best arrange the information,

ix

Page 9: Data Mining and Business Analytics with R (Ledolter/Data Mining and Business Analytics with R) || Frontmatter

x PREFACE

this text does not cover these aspects very deeply. This book concentrates on themodeling aspects of data mining.

The data sets and the R-code for all examples can be found on the webpage thataccompanies this book (http://www.biz.uiowa.edu/faculty/jledolter/DataMining).Supplementary material for this book can also be found by entering ISBN9781118447147 at booksupport.wiley.com. You can copy and paste the code intoyour own R session and rerun all analyses. You can experiment with the softwareby making changes and additions, and you can adapt the R templates to theanalysis of your own data sets. Exercises and several large practice data sets aregiven at the end of this book. The exercises will help instructors when assigninghomework problems, and they will give the reader the opportunity to practice thetechniques that are discussed in this book. Instructions on how to best use thesedata sets are given in Appendix A.

This is a first edition. Although I have tried to be very careful in my writing andin the analyses of the illustrative data sets, I am certain that much can be improved.I would very much appreciate any feedback you may have, and I encourage youto write to me at [email protected]. Corrections and comments will beposted on the book’s webpage.

Page 10: Data Mining and Business Analytics with R (Ledolter/Data Mining and Business Analytics with R) || Frontmatter

ACKNOWLEDGMENTS

I got interested in developing materials for an MBA-level text on Data Miningwhen I visited the University of Chicago Booth School of Business in 2011. Theoutstanding University of Chicago lecture materials for the course on Data Min-ing (BUS41201) taught by Professor Matt Taddy provided the spark to put thistext together, and several examples and R-templates from Professor Taddy’s noteshave influenced my presentation. Chapter 19 on the analysis of text data drawsheavily on his recent research. Professor Taddy’s contributions are most gratefullyacknowledged.

Writing a text is a time-consuming task. I could not have done this withoutthe support and constant encouragement of my wife, Lea Vandervelde. Lea, a lawprofessor at the University of Iowa, conducts historical research on the freedomsuits of Missouri slaves. She knows first-hand how important and difficult it is toconstruct data sets for the mining of text data.

xi