764
Springer Series in Statistics Trevor Hastie Robert Tibshirani Jerome Friedman The Elements of Statistical Learning Data Mining, Inference, and Prediction Second Edition

Eslii print10

Embed Size (px)

DESCRIPTION

Data mining

Citation preview

  • 1. Springer Series in StatisticsTrevor HastieRobert TibshiraniJerome FriedmanThe Elements ofStatistical LearningData Mining, Inference, and PredictionSecond Edition

2. This is page vPrinter: Opaque thisTo our parents:Valerie and Patrick HastieVera and Sami TibshiraniFlorence and Harry Friedmanand to our families:Samantha, Timothy, and LyndaCharlie, Ryan, Julie, and CherylMelanie, Dora, Monika, and Ildiko 3. vi 4. This is page viiPrinter: Opaque thisPreface to the Second EditionIn God we trust, all others bring data.William Edwards Deming (1900-1993)1We have been gratified by the popularity of the first edition of TheElements of Statistical Learning. This, along with the fast pace of researchin the statistical learning field, motivated us to update our book with asecond edition.We have added four new chapters and updated some of the existingchapters. Because many readers are familiar with the layout of the firstedition, we have tried to change it as little as possible. Here is a summaryof the main changes:1On the Web, this quote has been widely attributed to both Deming and Robert W.Hayden; however Professor Hayden told us that he can claim no credit for this quote,and ironically we could find no data confirming that Deming actually said this. 5. viii Preface to the Second EditionChapter Whats new1. Introduction2. Overview of Supervised Learning3. Linear Methods for Regression LAR algorithm and generalizationsof the lasso4. Linear Methods for Classification Lasso path for logistic regression5. Basis Expansions and Regulariza-tionAdditional illustrations of RKHS6. Kernel Smoothing Methods7. Model Assessment and Selection Strengths and pitfalls of cross-validation8. Model Inference and Averaging9. Additive Models, Trees, andRelated Methods10. Boosting and Additive Trees New example from ecology; somematerial split off to Chapter 16.11. Neural Networks Bayesian neural nets and the NIPS2003 challenge12. Support Vector Machines andFlexible DiscriminantsPath algorithm for SVM classifier13. Prototype Methods andNearest-Neighbors14. Unsupervised Learning Spectral clustering, kernel PCA,sparse PCA, non-negative matrixfactorization archetypal analysis,nonlinear dimension reduction,Google page rank algorithm, adirect approach to ICA15. Random Forests New16. Ensemble Learning New17. Undirected Graphical Models New18. High-Dimensional Problems NewSome further notes: Our first edition was unfriendly to colorblind readers; in particular,we tended to favor red/green contrasts which are particularly trou-blesome.We have changed the color palette in this edition to a largeextent, replacing the above with an orange/blue contrast. We have changed the name of Chapter 6 from Kernel Methods toKernel Smoothing Methods, to avoid confusion with the machine-learningkernel method that is discussed in the context of support vec-tormachines (Chapter 11) and more generally in Chapters 5 and 14. In the first edition, the discussion of error-rate estimation in Chap-ter7 was sloppy, as we did not clearly differentiate the notions ofconditional error rates (conditional on the training set) and uncondi-tionalrates. We have fixed this in the new edition. 6. Preface to the Second Edition ix Chapters 15 and 16 follow naturally from Chapter 10, and the chap-tersare probably best read in that order. In Chapter 17, we have not attempted a comprehensive treatmentof graphical models, and discuss only undirected models and somenew methods for their estimation. Due to a lack of space, we havespecifically omitted coverage of directed graphical models. Chapter 18 explores the pN problem, which is learning in high-dimensionalfeature spaces. These problems arise in many areas, in-cludinggenomic and proteomic studies, and document classification.We thank the many readers who have found the (too numerous) errors inthe first edition. We apologize for those and have done our best to avoid er-rorsin this new edition.We thank Mark Segal, Bala Rajaratnam, and LarryWasserman for comments on some of the new chapters, and many Stanfordgraduate and post-doctoral students who offered comments, in particularMohammed AlQuraishi, John Boik, Holger Hoefling, Arian Maleki, DonalMcMahon, Saharon Rosset, Babak Shababa, Daniela Witten, Ji Zhu andHui Zou.We thank John Kimmel for his patience in guiding us through thisnew edition. RT dedicates this edition to the memory of Anna McPhee.Trevor HastieRobert TibshiraniJerome FriedmanStanford, CaliforniaAugust 2008 7. x Preface to the Second Edition 8. This is page xiPrinter: Opaque thisPreface to the First EditionWe are drowning in information and starving for knowledge.Rutherford D. RogerThe field of Statistics is constantly challenged by the problems that scienceand industry brings to its door. In the early days, these problems often camefrom agricultural and industrial experiments and were relatively small inscope. With the advent of computers and the information age, statisticalproblems have exploded both in size and complexity. Challenges in theareas of data storage, organization and searching have led to the new fieldof data mining; statistical and computational problems in biology andmedicine have created bioinformatics. Vast amounts of data are beinggenerated in many fields, and the statisticians job is to make sense of itall: to extract important patterns and trends, and understand what thedata says. We call this learning from data.The challenges in learning from data have led to a revolution in the sta-tisticalsciences. Since computation plays such a key role, it is not surprisingthat much of this new development has been done by researchers in otherfields such as computer science and engineering.The learning problems that we consider can be roughly categorized aseither supervised or unsupervised. In supervised learning, the goal is to pre-dictthe value of an outcome measure based on a number of input measures;in unsupervised learning, there is no outcome measure, and the goal is todescribe the associations and patterns among a set of input measures. 9. xii Preface to the First EditionThis book is our attempt to bring together many of the important newideas in learning, and explain them in a statistical framework. While somemathematical details are needed, we emphasize the methods and their con-ceptualunderpinnings rather than their theoretical properties. As a result,we hope that this book will appeal not just to statisticians but also toresearchers and practitioners in a wide variety of fields.Just as we have learned a great deal from researchers outside of the fieldof statistics, our statistical viewpoint may help others to better understanddifferent aspects of learning:There is no true interpretation of anything; interpretation is avehicle in the service of human comprehension. The value ofinterpretation is in enabling others to fruitfully think about anidea.Andreas BujaWe would like to acknowledge the contribution of many people to theconception and completion of this book. David Andrews, Leo Breiman,Andreas Buja, John Chambers, Bradley Efron, Geoffrey Hinton, WernerStuetzle, and John Tukey have greatly influenced our careers. Balasub-ramanianNarasimhan gave us advice and help on many computationalproblems, and maintained an excellent computing environment. Shin-HoBang helped in the production of a number of the figures. Lee Wilkinsongave valuable tips on color production. Ilana Belitskaya, Eva Cantoni, MayaGupta, Michael Jordan, Shanti Gopatam, Radford Neal, Jorge Picazo, Bog-danPopescu, Olivier Renaud, Saharon Rosset, John Storey, Ji Zhu, MuZhu, two reviewers and many students read parts of the manuscript andoffered helpful suggestions. John Kimmel was supportive, patient and help-fulat every phase; MaryAnn Brickner and Frank Ganz headed a superbproduction team at Springer. Trevor Hastie would like to thank the statis-ticsdepartment at the University of Cape Town for their hospitality duringthe final stages of this book. We gratefully acknowledge NSF and NIH fortheir support of this work. Finally, we would like to thank our families andour parents for their love and support.Trevor HastieRobert TibshiraniJerome FriedmanStanford, CaliforniaMay 2001The quiet statisticians have changed our world; not by discov-eringnew facts or technical developments, but by changing theways that we reason, experiment and form our opinions ....Ian Hacking 10. This is page xiiiPrinter: Opaque thisContentsPreface to the Second Edition viiPreface to the First Edition xi1 Introduction 12 Overview of Supervised Learning 92.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Variable Types and Terminology . . . . . . . . . . . . . . 92.3 Two Simple Approaches to Prediction:Least Squares and Nearest Neighbors . . . . . . . . . . . 112.3.1 Linear Models and Least Squares . . . . . . . . 112.3.2 Nearest-Neighbor Methods . . . . . . . . . . . . 142.3.3 From Least Squares to Nearest Neighbors . . . . 162.4 Statistical Decision Theory . . . . . . . . . . . . . . . . . 182.5 Local Methods in High Dimensions . . . . . . . . . . . . . 222.6 Statistical Models, Supervised Learningand Function Approximation . . . . . . . . . . . . . . . . 282.6.1 A Statistical Modelfor the Joint Distribution Pr(X, Y ) . . . . . . . 282.6.2 Supervised Learning . . . . . . . . . . . . . . . . 292.6.3 Function Approximation . . . . . . . . . . . . . 292.7 Structured Regression Models . . . . . . . . . . . . . . . 322.7.1 Difficulty of the Problem . . . . . . . . . . . . . 32 11. xiv Contents2.8 Classes of Restricted Estimators . . . . . . . . . . . . . . 332.8.1 Roughness Penalty and Bayesian Methods . . . 342.8.2 Kernel Methods and Local Regression . . . . . . 342.8.3 Basis Functions and Dictionary Methods . . . . 352.9 Model Selection and the BiasVariance Tradeoff . . . . . 37Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 39Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393 Linear Methods for Regression 433.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 433.2 Linear Regression Models and Least Squares . . . . . . . 443.2.1 Example: Prostate Cancer . . . . . . . . . . . . 493.2.2 The GaussMarkov Theorem . . . . . . . . . . . 513.2.3 Multiple Regressionfrom Simple Univariate Regression . . . . . . . . 523.2.4 Multiple Outputs . . . . . . . . . . . . . . . . . 563.3 Subset Selection . . . . . . . . . . . . . . . . . . . . . . . 573.3.1 Best-Subset Selection . . . . . . . . . . . . . . . 573.3.2 Forward- and Backward-Stepwise Selection . . . 583.3.3 Forward-Stagewise Regression . . . . . . . . . . 603.3.4 Prostate Cancer Data Example (Continued) . . 613.4 Shrinkage Methods . . . . . . . . . . . . . . . . . . . . . . 613.4.1 Ridge Regression . . . . . . . . . . . . . . . . . 613.4.2 The Lasso . . . . . . . . . . . . . . . . . . . . . 683.4.3 Discussion: Subset Selection, Ridge Regressionand the Lasso . . . . . . . . . . . . . . . . . . . 693.4.4 Least Angle Regression . . . . . . . . . . . . . . 733.5 Methods Using Derived Input Directions . . . . . . . . . 793.5.1 Principal Components Regression . . . . . . . . 793.5.2 Partial Least Squares . . . . . . . . . . . . . . . 803.6 Discussion: A Comparison of the Selectionand Shrinkage Methods . . . . . . . . . . . . . . . . . . . 823.7 Multiple Outcome Shrinkage and Selection . . . . . . . . 843.8 More on the Lasso and Related Path Algorithms . . . . . 863.8.1 Incremental Forward Stagewise Regression . . . 863.8.2 Piecewise-Linear Path Algorithms . . . . . . . . 893.8.3 The Dantzig Selector . . . . . . . . . . . . . . . 893.8.4 The Grouped Lasso . . . . . . . . . . . . . . . . 903.8.5 Further Properties of the Lasso . . . . . . . . . . 913.8.6 Pathwise Coordinate Optimization . . . . . . . . 923.9 Computational Considerations . . . . . . . . . . . . . . . 93Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 94Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 12. Contents xv4 Linear Methods for Classification 1014.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 1014.2 Linear Regression of an Indicator Matrix . . . . . . . . . 1034.3 Linear Discriminant Analysis . . . . . . . . . . . . . . . . 1064.3.1 Regularized Discriminant Analysis . . . . . . . . 1124.3.2 Computations for LDA . . . . . . . . . . . . . . 1134.3.3 Reduced-Rank Linear Discriminant Analysis . . 1134.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . 1194.4.1 Fitting Logistic Regression Models . . . . . . . . 1204.4.2 Example: South African Heart Disease . . . . . 1224.4.3 Quadratic Approximations and Inference . . . . 1244.4.4 L1 Regularized Logistic Regression . . . . . . . . 1254.4.5 Logistic Regression or LDA? . . . . . . . . . . . 1274.5 Separating Hyperplanes . . . . . . . . . . . . . . . . . . . 1294.5.1 Rosenblatts Perceptron Learning Algorithm . . 1304.5.2 Optimal Separating Hyperplanes . . . . . . . . . 132Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 135Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1355 Basis Expansions and Regularization 1395.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 1395.2 Piecewise Polynomials and Splines . . . . . . . . . . . . . 1415.2.1 Natural Cubic Splines . . . . . . . . . . . . . . . 1445.2.2 Example: South African Heart Disease (Continued)1465.2.3 Example: Phoneme Recognition . . . . . . . . . 1485.3 Filtering and Feature Extraction . . . . . . . . . . . . . . 1505.4 Smoothing Splines . . . . . . . . . . . . . . . . . . . . . . 1515.4.1 Degrees of Freedom and Smoother Matrices . . . 1535.5 Automatic Selection of the Smoothing Parameters . . . . 1565.5.1 Fixing the Degrees of Freedom . . . . . . . . . . 1585.5.2 The BiasVariance Tradeoff . . . . . . . . . . . . 1585.6 Nonparametric Logistic Regression . . . . . . . . . . . . . 1615.7 Multidimensional Splines . . . . . . . . . . . . . . . . . . 1625.8 Regularization and Reproducing Kernel Hilbert Spaces . 1675.8.1 Spaces of Functions Generated by Kernels . . . 1685.8.2 Examples of RKHS . . . . . . . . . . . . . . . . 1705.9 Wavelet Smoothing . . . . . . . . . . . . . . . . . . . . . 1745.9.1 Wavelet Bases and the Wavelet Transform . . . 1765.9.2 Adaptive Wavelet Filtering . . . . . . . . . . . . 179Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 181Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181Appendix: Computational Considerations for Splines . . . . . . 186Appendix: B-splines . . . . . . . . . . . . . . . . . . . . . 186Appendix: Computations for Smoothing Splines . . . . . 189 13. xvi Contents6 Kernel Smoothing Methods 1916.1 One-Dimensional Kernel Smoothers . . . . . . . . . . . . 1926.1.1 Local Linear Regression . . . . . . . . . . . . . . 1946.1.2 Local Polynomial Regression . . . . . . . . . . . 1976.2 Selecting the Width of the Kernel . . . . . . . . . . . . . 1986.3 Local Regression in IRp . . . . . . . . . . . . . . . . . . . 2006.4 Structured Local Regression Models in IRp . . . . . . . . 2016.4.1 Structured Kernels . . . . . . . . . . . . . . . . . 2036.4.2 Structured Regression Functions . . . . . . . . . 2036.5 Local Likelihood and Other Models . . . . . . . . . . . . 2056.6 Kernel Density Estimation and Classification . . . . . . . 2086.6.1 Kernel Density Estimation . . . . . . . . . . . . 2086.6.2 Kernel Density Classification . . . . . . . . . . . 2106.6.3 The Naive Bayes Classifier . . . . . . . . . . . . 2106.7 Radial Basis Functions and Kernels . . . . . . . . . . . . 2126.8 Mixture Models for Density Estimation and Classification 2146.9 Computational Considerations . . . . . . . . . . . . . . . 216Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 216Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2167 Model Assessment and Selection 2197.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2197.2 Bias, Variance and Model Complexity . . . . . . . . . . . 2197.3 The BiasVariance Decomposition . . . . . . . . . . . . . 2237.3.1 Example: BiasVariance Tradeoff . . . . . . . . 2267.4 Optimism of the Training Error Rate . . . . . . . . . . . 2287.5 Estimates of In-Sample Prediction Error . . . . . . . . . . 2307.6 The Effective Number of Parameters . . . . . . . . . . . . 2327.7 The Bayesian Approach and BIC . . . . . . . . . . . . . . 2337.8 Minimum Description Length . . . . . . . . . . . . . . . . 2357.9 VapnikChervonenkis Dimension . . . . . . . . . . . . . . 2377.9.1 Example (Continued) . . . . . . . . . . . . . . . 2397.10 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . 2417.10.1 K-Fold Cross-Validation . . . . . . . . . . . . . 2417.10.2 The Wrong and Right Wayto Do Cross-validation . . . . . . . . . . . . . . . 2457.10.3 Does Cross-Validation Really Work? . . . . . . . 2477.11 Bootstrap Methods . . . . . . . . . . . . . . . . . . . . . 2497.11.1 Example (Continued) . . . . . . . . . . . . . . . 2527.12 Conditional or Expected Test Error? . . . . . . . . . . . . 254Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 257Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2578 Model Inference and Averaging 2618.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 261 14. Contents xvii8.2 The Bootstrap and Maximum Likelihood Methods . . . . 2618.2.1 A Smoothing Example . . . . . . . . . . . . . . 2618.2.2 Maximum Likelihood Inference . . . . . . . . . . 2658.2.3 Bootstrap versus Maximum Likelihood . . . . . 2678.3 Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . 2678.4 Relationship Between the Bootstrapand Bayesian Inference . . . . . . . . . . . . . . . . . . . 2718.5 The EM Algorithm . . . . . . . . . . . . . . . . . . . . . 2728.5.1 Two-Component Mixture Model . . . . . . . . . 2728.5.2 The EM Algorithm in General . . . . . . . . . . 2768.5.3 EM as a MaximizationMaximization Procedure 2778.6 MCMC for Sampling from the Posterior . . . . . . . . . . 2798.7 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2828.7.1 Example: Trees with Simulated Data . . . . . . 2838.8 Model Averaging and Stacking . . . . . . . . . . . . . . . 2888.9 Stochastic Search: Bumping . . . . . . . . . . . . . . . . . 290Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 292Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2939 Additive Models, Trees, and Related Methods 2959.1 Generalized Additive Models . . . . . . . . . . . . . . . . 2959.1.1 Fitting Additive Models . . . . . . . . . . . . . . 2979.1.2 Example: Additive Logistic Regression . . . . . 2999.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . 3049.2 Tree-Based Methods . . . . . . . . . . . . . . . . . . . . . 3059.2.1 Background . . . . . . . . . . . . . . . . . . . . 3059.2.2 Regression Trees . . . . . . . . . . . . . . . . . . 3079.2.3 Classification Trees . . . . . . . . . . . . . . . . 3089.2.4 Other Issues . . . . . . . . . . . . . . . . . . . . 3109.2.5 Spam Example (Continued) . . . . . . . . . . . 3139.3 PRIM: Bump Hunting . . . . . . . . . . . . . . . . . . . . 3179.3.1 Spam Example (Continued) . . . . . . . . . . . 3209.4 MARS: Multivariate Adaptive Regression Splines . . . . . 3219.4.1 Spam Example (Continued) . . . . . . . . . . . 3269.4.2 Example (Simulated Data) . . . . . . . . . . . . 3279.4.3 Other Issues . . . . . . . . . . . . . . . . . . . . 3289.5 Hierarchical Mixtures of Experts . . . . . . . . . . . . . . 3299.6 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . 3329.7 Computational Considerations . . . . . . . . . . . . . . . 334Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 334Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33510 Boosting and Additive Trees 33710.1 Boosting Methods . . . . . . . . . . . . . . . . . . . . . . 33710.1.1 Outline of This Chapter . . . . . . . . . . . . . . 340 15. xviii Contents10.2 Boosting Fits an Additive Model . . . . . . . . . . . . . . 34110.3 Forward Stagewise Additive Modeling . . . . . . . . . . . 34210.4 Exponential Loss and AdaBoost . . . . . . . . . . . . . . 34310.5 Why Exponential Loss? . . . . . . . . . . . . . . . . . . . 34510.6 Loss Functions and Robustness . . . . . . . . . . . . . . . 34610.7 Off-the-Shelf Procedures for Data Mining . . . . . . . . 35010.8 Example: Spam Data . . . . . . . . . . . . . . . . . . . . 35210.9 Boosting Trees . . . . . . . . . . . . . . . . . . . . . . . . 35310.10 Numerical Optimization via Gradient Boosting . . . . . . 35810.10.1 Steepest Descent . . . . . . . . . . . . . . . . . . 35810.10.2 Gradient Boosting . . . . . . . . . . . . . . . . . 35910.10.3 Implementations of Gradient Boosting . . . . . . 36010.11 Right-Sized Trees for Boosting . . . . . . . . . . . . . . . 36110.12 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 36410.12.1 Shrinkage . . . . . . . . . . . . . . . . . . . . . . 36410.12.2 Subsampling . . . . . . . . . . . . . . . . . . . . 36510.13 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . 36710.13.1 Relative Importance of Predictor Variables . . . 36710.13.2 Partial Dependence Plots . . . . . . . . . . . . . 36910.14 Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . 37110.14.1 California Housing . . . . . . . . . . . . . . . . . 37110.14.2 New Zealand Fish . . . . . . . . . . . . . . . . . 37510.14.3 Demographics Data . . . . . . . . . . . . . . . . 379Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 380Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38411 Neural Networks 38911.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 38911.2 Projection Pursuit Regression . . . . . . . . . . . . . . . 38911.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 39211.4 Fitting Neural Networks . . . . . . . . . . . . . . . . . . . 39511.5 Some Issues in Training Neural Networks . . . . . . . . . 39711.5.1 Starting Values . . . . . . . . . . . . . . . . . . . 39711.5.2 Overfitting . . . . . . . . . . . . . . . . . . . . . 39811.5.3 Scaling of the Inputs . . . . . . . . . . . . . . . 39811.5.4 Number of Hidden Units and Layers . . . . . . . 40011.5.5 Multiple Minima . . . . . . . . . . . . . . . . . . 40011.6 Example: Simulated Data . . . . . . . . . . . . . . . . . . 40111.7 Example: ZIP Code Data . . . . . . . . . . . . . . . . . . 40411.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 40811.9 Bayesian Neural Nets and the NIPS 2003 Challenge . . . 40911.9.1 Bayes, Boosting and Bagging . . . . . . . . . . . 41011.9.2 Performance Comparisons . . . . . . . . . . . . 41211.10 Computational Considerations . . . . . . . . . . . . . . . 414Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 415 16. Contents xixExercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41512 Support Vector Machines andFlexible Discriminants 41712.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 41712.2 The Support Vector Classifier . . . . . . . . . . . . . . . . 41712.2.1 Computing the Support Vector Classifier . . . . 42012.2.2 Mixture Example (Continued) . . . . . . . . . . 42112.3 Support Vector Machines and Kernels . . . . . . . . . . . 42312.3.1 Computing the SVM for Classification . . . . . . 42312.3.2 The SVM as a Penalization Method . . . . . . . 42612.3.3 Function Estimation and Reproducing Kernels . 42812.3.4 SVMs and the Curse of Dimensionality . . . . . 43112.3.5 A Path Algorithm for the SVM Classifier . . . . 43212.3.6 Support Vector Machines for Regression . . . . . 43412.3.7 Regression and Kernels . . . . . . . . . . . . . . 43612.3.8 Discussion . . . . . . . . . . . . . . . . . . . . . 43812.4 Generalizing Linear Discriminant Analysis . . . . . . . . 43812.5 Flexible Discriminant Analysis . . . . . . . . . . . . . . . 44012.5.1 Computing the FDA Estimates . . . . . . . . . . 44412.6 Penalized Discriminant Analysis . . . . . . . . . . . . . . 44612.7 Mixture Discriminant Analysis . . . . . . . . . . . . . . . 44912.7.1 Example: Waveform Data . . . . . . . . . . . . . 451Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 455Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45513 Prototype Methods and Nearest-Neighbors 45913.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 45913.2 Prototype Methods . . . . . . . . . . . . . . . . . . . . . 45913.2.1 K-means Clustering . . . . . . . . . . . . . . . . 46013.2.2 Learning Vector Quantization . . . . . . . . . . 46213.2.3 Gaussian Mixtures . . . . . . . . . . . . . . . . . 46313.3 k-Nearest-Neighbor Classifiers . . . . . . . . . . . . . . . 46313.3.1 Example: A Comparative Study . . . . . . . . . 46813.3.2 Example: k-Nearest-Neighborsand Image Scene Classification . . . . . . . . . . 47013.3.3 Invariant Metrics and Tangent Distance . . . . . 47113.4 Adaptive Nearest-Neighbor Methods . . . . . . . . . . . . 47513.4.1 Example . . . . . . . . . . . . . . . . . . . . . . 47813.4.2 Global Dimension Reductionfor Nearest-Neighbors . . . . . . . . . . . . . . . 47913.5 Computational Considerations . . . . . . . . . . . . . . . 480Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 481Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481 17. xx Contents14 Unsupervised Learning 48514.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 48514.2 Association Rules . . . . . . . . . . . . . . . . . . . . . . 48714.2.1 Market Basket Analysis . . . . . . . . . . . . . . 48814.2.2 The Apriori Algorithm . . . . . . . . . . . . . . 48914.2.3 Example: Market Basket Analysis . . . . . . . . 49214.2.4 Unsupervised as Supervised Learning . . . . . . 49514.2.5 Generalized Association Rules . . . . . . . . . . 49714.2.6 Choice of Supervised Learning Method . . . . . 49914.2.7 Example: Market Basket Analysis (Continued) . 49914.3 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . 50114.3.1 Proximity Matrices . . . . . . . . . . . . . . . . 50314.3.2 Dissimilarities Based on Attributes . . . . . . . 50314.3.3 Object Dissimilarity . . . . . . . . . . . . . . . . 50514.3.4 Clustering Algorithms . . . . . . . . . . . . . . . 50714.3.5 Combinatorial Algorithms . . . . . . . . . . . . 50714.3.6 K-means . . . . . . . . . . . . . . . . . . . . . . 50914.3.7 Gaussian Mixtures as Soft K-means Clustering . 51014.3.8 Example: Human Tumor Microarray Data . . . 51214.3.9 Vector Quantization . . . . . . . . . . . . . . . . 51414.3.10 K-medoids . . . . . . . . . . . . . . . . . . . . . 51514.3.11 Practical Issues . . . . . . . . . . . . . . . . . . 51814.3.12 Hierarchical Clustering . . . . . . . . . . . . . . 52014.4 Self-Organizing Maps . . . . . . . . . . . . . . . . . . . . 52814.5 Principal Components, Curves and Surfaces . . . . . . . . 53414.5.1 Principal Components . . . . . . . . . . . . . . . 53414.5.2 Principal Curves and Surfaces . . . . . . . . . . 54114.5.3 Spectral Clustering . . . . . . . . . . . . . . . . 54414.5.4 Kernel Principal Components . . . . . . . . . . . 54714.5.5 Sparse Principal Components . . . . . . . . . . . 55014.6 Non-negative Matrix Factorization . . . . . . . . . . . . . 55314.6.1 Archetypal Analysis . . . . . . . . . . . . . . . . 55414.7 Independent Component Analysisand Exploratory Projection Pursuit . . . . . . . . . . . . 55714.7.1 Latent Variables and Factor Analysis . . . . . . 55814.7.2 Independent Component Analysis . . . . . . . . 56014.7.3 Exploratory Projection Pursuit . . . . . . . . . . 56514.7.4 A Direct Approach to ICA . . . . . . . . . . . . 56514.8 Multidimensional Scaling . . . . . . . . . . . . . . . . . . 57014.9 Nonlinear Dimension Reductionand Local Multidimensional Scaling . . . . . . . . . . . . 57214.10 The Google PageRank Algorithm . . . . . . . . . . . . . 576Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 578Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579 18. Contents xxi15 Random Forests 58715.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 58715.2 Definition of Random Forests . . . . . . . . . . . . . . . . 58715.3 Details of Random Forests . . . . . . . . . . . . . . . . . 59215.3.1 Out of Bag Samples . . . . . . . . . . . . . . . . 59215.3.2 Variable Importance . . . . . . . . . . . . . . . . 59315.3.3 Proximity Plots . . . . . . . . . . . . . . . . . . 59515.3.4 Random Forests and Overfitting . . . . . . . . . 59615.4 Analysis of Random Forests . . . . . . . . . . . . . . . . . 59715.4.1 Variance and the De-Correlation Effect . . . . . 59715.4.2 Bias . . . . . . . . . . . . . . . . . . . . . . . . . 60015.4.3 Adaptive Nearest Neighbors . . . . . . . . . . . 601Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 602Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60316 Ensemble Learning 60516.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 60516.2 Boosting and Regularization Paths . . . . . . . . . . . . . 60716.2.1 Penalized Regression . . . . . . . . . . . . . . . 60716.2.2 The Bet on Sparsity Principle . . . . . . . . . 61016.2.3 Regularization Paths, Over-fitting and Margins . 61316.3 Learning Ensembles . . . . . . . . . . . . . . . . . . . . . 61616.3.1 Learning a Good Ensemble . . . . . . . . . . . . 61716.3.2 Rule Ensembles . . . . . . . . . . . . . . . . . . 622Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 623Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62417 Undirected Graphical Models 62517.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 62517.2 Markov Graphs and Their Properties . . . . . . . . . . . 62717.3 Undirected Graphical Models for Continuous Variables . 63017.3.1 Estimation of the Parameterswhen the Graph Structure is Known . . . . . . . 63117.3.2 Estimation of the Graph Structure . . . . . . . . 63517.4 Undirected Graphical Models for Discrete Variables . . . 63817.4.1 Estimation of the Parameterswhen the Graph Structure is Known . . . . . . . 63917.4.2 Hidden Nodes . . . . . . . . . . . . . . . . . . . 64117.4.3 Estimation of the Graph Structure . . . . . . . . 64217.4.4 Restricted Boltzmann Machines . . . . . . . . . 643Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64518 High-Dimensional Problems: pN 64918.1 When p is Much Bigger than N . . . . . . . . . . . . . . 649 19. xxii Contents18.2 Diagonal Linear Discriminant Analysisand Nearest Shrunken Centroids . . . . . . . . . . . . . . 65118.3 Linear Classifiers with Quadratic Regularization . . . . . 65418.3.1 Regularized Discriminant Analysis . . . . . . . . 65618.3.2 Logistic Regressionwith Quadratic Regularization . . . . . . . . . . 65718.3.3 The Support Vector Classifier . . . . . . . . . . 65718.3.4 Feature Selection . . . . . . . . . . . . . . . . . . 65818.3.5 Computational Shortcuts When pN . . . . . 65918.4 Linear Classifiers with L1 Regularization . . . . . . . . . 66118.4.1 Application of Lassoto Protein Mass Spectroscopy . . . . . . . . . . 66418.4.2 The Fused Lasso for Functional Data . . . . . . 66618.5 Classification When Features are Unavailable . . . . . . . 66818.5.1 Example: String Kernelsand Protein Classification . . . . . . . . . . . . . 66818.5.2 Classification and Other Models UsingInner-Product Kernels and Pairwise Distances . 67018.5.3 Example: Abstracts Classification . . . . . . . . 67218.6 High-Dimensional Regression:Supervised Principal Components . . . . . . . . . . . . . 67418.6.1 Connection to Latent-Variable Modeling . . . . 67818.6.2 Relationship with Partial Least Squares . . . . . 68018.6.3 Pre-Conditioning for Feature Selection . . . . . 68118.7 Feature Assessment and the Multiple-Testing Problem . . 68318.7.1 The False Discovery Rate . . . . . . . . . . . . . 68718.7.2 Asymmetric Cutpoints and the SAM Procedure 69018.7.3 A Bayesian Interpretation of the FDR . . . . . . 69218.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . 693Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694References 699Author Index 729Index 737 20. This is page 1Printer: Opaque this1IntroductionStatistical learning plays a key role in many areas of science, finance andindustry. Here are some examples of learning problems: Predict whether a patient, hospitalized due to a heart attack, willhave a second heart attack. The prediction is to be based on demo-graphic,diet and clinical measurements for that patient. Predict the price of a stock in 6 months from now, on the basis ofcompany performance measures and economic data. Identify the numbers in a handwritten ZIP code, from a digitizedimage. Estimate the amount of glucose in the blood of a diabetic person,from the infrared absorption spectrum of that persons blood. Identify the risk factors for prostate cancer, based on clinical anddemographic variables.The science of learning plays a key role in the fields of statistics, datamining and artificial intelligence, intersecting with areas of engineering andother disciplines.This book is about learning from data. In a typical scenario, we havean outcome measurement, usually quantitative (such as a stock price) orcategorical (such as heart attack/no heart attack), that we wish to predictbased on a set of features (such as diet and clinical measurements). Wehave a training set of data, in which we observe the outcome and feature 21. 2 1. IntroductionTABLE 1.1. Average percentage of words or characters in an email messageequal to the indicated word or character. We have chosen the words and charactersshowing the largest difference between spam and email.george you your hp free hpl ! our re edu removespam 0.00 2.26 1.38 0.02 0.52 0.01 0.51 0.51 0.13 0.01 0.28email 1.27 1.27 0.44 0.90 0.07 0.43 0.11 0.18 0.42 0.29 0.01measurements for a set of objects (such as people). Using this data we builda prediction model, or learner, which will enable us to predict the outcomefor new unseen objects. A good learner is one that accurately predicts suchan outcome.The examples above describe what is called the supervised learning prob-lem.It is called supervised because of the presence of the outcome vari-ableto guide the learning process. In the unsupervised learning problem,we observe only the features and have no measurements of the outcome.Our task is rather to describe how the data are organized or clustered. Wedevote most of this book to supervised learning; the unsupervised problemis less developed in the literature, and is the focus of Chapter 14.Here are some examples of real learning problems that are discussed inthis book.Example 1: Email SpamThe data for this example consists of information from 4601 email mes-sages,in a study to try to predict whether the email was junk email, orspam. The objective was to design an automatic spam detector thatcould filter out spam before clogging the users mailboxes. For all 4601email messages, the true outcome (email type) email or spam is available,along with the relative frequencies of 57 of the most commonly occurringwords and punctuation marks in the email message. This is a supervisedlearning problem, with the outcome the class variable email/spam. It is alsocalled a classification problem.Table 1.1 lists the words and characters showing the largest averagedifference between spam and email.Our learning method has to decide which features to use and how: forexample, we might use a rule such asif (%george0.6)(%you1.5) then spamelse email.Another form of a rule might be:if (0.2 %you 0.3 %george)0 then spamelse email. 22. 1. Introduction 3lpsa1 1 2 3 4o ooo o oo o o ooo oo ooooooo ooooo ooooo o oooooooooo oo o ooo oooooooooooooo oo oo oooo oooo oooo o o oooo ooooooo oo o ooooooo ooooooooooooo o oooooo ooo ooo ooooo oooooooo o oooooooooooooo oo o oo ooooooooo oo o o o o o ooo oooooo oo40 50 60 70 80o oooo oooooo oooo o oo ooooooooooooooooo o oo ooooo o oooooooooo oo o ooo o oo oo o o oooo oooooooooo o oo ooo o oo o o o o oooooo oooo ooo o ooo o ooooo ooo o o o o oooo ooooo o oooooooooooooo oo o o o oo oo o o o o o o o oo o o oo oo oo o0.0 0.4 0.8o o o o o o ooo o o o o o o ooooooo ooo o o oooo o oooooooooooooo ooooo oo oooooo ooo o o oooo ooo o o oooo oooo o oo o ooooo oo oo o o ooooo o oo oo o o oo ooooo o ooo oo o o o oo ooooo oo oo o o6.0 7.0 8.0 9.0o o o o o o ooooooo oooo oo oooo o ooo o o o o oo o o oo o o o oooooo oooo o o ooo ooooo o oo o0 1 2 3 4 5ooooooo o o ooooo ooooo ooo o o oo oo ooo ooo o o o o o oo o o ooo oo o o oo o oooooo o oo o o o o o ooo ooooo o oo oo o o1 1 2 3 4o oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooolcavolooooo oo oo o oooooo ooo oooo oooo ooooooooooooooooooooooo oooooooooooooo oo oooo oooooooo ooooo o ooo oooo ooo o oooo o ooo o ooo oo o o ooo ooooooooo ooooooooooooooooooo ooo oooooooooooo ooo ooooooooooo ooo ooooo ooo o ooo oo o oo ooooooo ooooooo oooooo ooooooooooooooooo oo oooo ooo oooooo o ooo ooo ooooooo ooooooooooooooo ooo oooooo ooooo oooooo ooooo ooooooo oooo oooooo ooooooooooo ooooooooooooo oo o o ooooo ooooooooo ooooooooooooo ooo ooooooo ooooooo ooooo oooooo oooo o oooooooo oooooooo oooooooo ooooooooooooooooooooooooooo oo oooooooooooo o oooo ooooooo oooooooo oooo oo o oo oo oooooooooo ooooooooooooooooooooooooooo ooooooooooo oooooo oooooooo oo o oooooo ooooo ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo oo oo oooo oooo ooo oo ooo oooooooo ooo oooo oooooooooooooooooooooooooooo oooooooo ooooooooooo oooolweightoooooo ooo oo ooooooo ooo o oooooooo ooooo o oooooo oooo ooo oooooooooooooooooooo oooooo o oooooooo oo oooo o ooooooooo oooooo oooooo oooooooo ooooooooo ooooooooooooooooooooooooooo oooooooooooooooooo oooo ooooo oooo ooooooooo ooooo oo ooooooooo ooooooo oooo oo ooooo oooooooooooooooooooo ooo ooooo ooo o oooooo ooooooo oooooooooo ooooooooooooooooooooo oooooo o oooooooooo oooooooo oooo oo oooo oooooo ooo oo oooooooooo oooooooooooooooooooooo ooooooooooo oo o2.5 3.5 4.5ooo ooooo ooooooooo o ooooooooooo oo oooooooooooo ooooooooo ooooooooooooooooooooo oooooooo ooo40 50 60 70 80oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo oooooo oooo ooo o oo oo oooooo oo ooooooooooooooooo o oooooooooooooo oooooooooooooooooooooooooooooo oo oo o o oo ooo oo o oooooooooooooooooooooooooooooooooooooooo ooooooooooooooooooooooooooo ageoooooo oo ooooo oo ooooooo ooo oooooo oooooo ooooooooooo oooooooooooooooooooooooooo ooooo oooooo ooooooooo ooooooooo ooooooo oooooo oooooo ooo oooooo oooo ooooooooooooooooooo oooooo ooooo ooooooooooooooo oooooo o o oooooooo ooooooo ooooooooooo ooooooooooooooooooooooooooooooo o ooo ooo o oooo ooooooooooo ooooooooo ooooooooo o ooo o oooooooo ooooo oooo ooooooooooooo o oooooooooooooooooooooooooooooooooooooooooooooooooooo ooooo oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo o oooooooo ooooooooooo oooooooooooooo oooooooooooooooooooooooooooo ooooooooo ooooooooooooooo ooooooo ooooooooooooooooooo ooooooooooooooooooooooooooo oo oooooooooooooo oooooo o o o oooooo oooooooooo oooooooooo ooooooooooooooooooooo ooooooooooo o oo lbphoooooooooooooooooooooo ooo oooooo ooooooooooooooo ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo ooooo ooooooooooooooooooooo ooooooooooo1 0 1 2ooooooooooooooo oooooooooooooo oooooooooooooooooooooooo ooooooo o0.0 0.4 0.8ooo oooooooooooooooooooooooooooooooooooooooooooooooo ooooo oooo oooooooooooooooooooooooooooooooo oo ooooo oo oooooooo oooooo o ooooooooo ooo oooooooooo oo ooooooooooooooooo ooo oo o o o ooooooooooo ooooooooooo ooo o ooooooo oooooo oo oooooooooooooooooo ooo oooo ooooooooooo ooooooooo ooooo ooo o o oooooooooo oooo oo o oooooooooo oooooo ooooo svio o o o o ooo ooo ooooooo oo oo ooooooo o o ooooo ooooooooooooo oo ooooooo ooo oo o ooooooo o o ooo oo ooooooo oooooooooooooo oooooooooooooooooooooooooooooooooooooo ooooooooooooooooooooooooooooooo o oooooo ooooo oooooooooooooooo oooo ooooooooooo oooooooooooooooooooooooooooooooo oooooooooo ooooooo oooooooo oooooooo ooooooooo o oooooooooooo ooooo oooooooooooooooooooooooooooooooooooo ooooo ooooo oooooooo o ooooooooo oooooooooooooo ooooooooooooooooo oooooo o oooooooooooo ooooooooooooooooooooooo ooooooooooooooo ooooo oo ooooooooooooooooooooolcpooooo oooooooooooooooooooo oooooooooooooo oooo1 0 1 2 3oooooooooooooooo oooooooooooooooooooooooooooooo ooooooooo ooo6.0 7.0 8.0 9.0oooooooo oooooooooooooooooooooooooooooooooooooooooooooooooooooooo o oooo ooooo ooo ooooooooooooo ooooo o ooooooo ooooooooooooooooo ooo oooooo oo ooooooo ooo o oo oo oooooo oooooooo oooo oooo oooo o oooooooooooooooooo oo oooooooooo o o o o ooo ooooooo ooooooo oo oooooo ooooo ooooo oo oooo ooooooo oooo oooooooo o oooo oooooo oooooooo oo oooooooooo o oo oooooooooooooooooo oooooooo oooo ooooo ooo oo oooooo ooooo oogleasonooooooooo oooooooooo ooo oooooo ooooooooo ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo0 1 2 3 4 5ooooooooooooo ooooo ooo oooooooo ooooooooooo ooo ooo oooooooooooooooooo oooooooooooooooooooooooooooooooooooo ooooooooo ooooo ooo ooooooooooooooooooo o ooo o oooooooooooo ooooooooooo ooo2.5 3.5 4.5ooo ooooooooooo ooo o oooo ooooo oooooo oo oo ooooo oooo oooooooooooooooooooooooooooooooooooooooooo oooo ooo1 0 1 2oooooo ooo oooo oooooooooo ooooooooooooooooooooooooooooooooooo ooo oooooooooooooooo oooooooooooooooooooooooooooooooooooooo ooo oooooooo ooooooooo ooooooooooooo1 0 1 2 3ooo oooooooooooooooo ooooooooooo0 20 60 1000 20 60 100pgg45FIGURE 1.1. Scatterplot matrix of the prostate cancer data. The first row showsthe response against each of the predictors in turn. Two of the predictors, svi andgleason, are categorical.For this problem not all errors are equal; we want to avoid filtering outgood email, while letting spam get through is not desirable but less seriousin its consequences. We discuss a number of different methods for tacklingthis learning problem in the book.Example 2: Prostate CancerThe data for this example, displayed in Figure 1.11, come from a studyby Stamey et al. (1989) that examined the correlation between the level of1There was an error in these data in the first edition of this book. Subject 32 hada value of 6.1 for lweight, which translates to a 449 gm prostate! The correct value is44.9 gm. We are grateful to Prof. Stephen W. Link for alerting us to this error. 23. 4 1. IntroductionFIGURE 1.2. Examples of handwritten digits from U.S. postal envelopes.prostate specific antigen (PSA) and a number of clinical measures, in 97men who were about to receive a radical prostatectomy.The goal is to predict the log of PSA (lpsa) from a number of measure-mentsincluding log cancer volume (lcavol), log prostate weight lweight,age, log of benign prostatic hyperplasia amount lbph, seminal vesicle in-vasionsvi, log of capsular penetration lcp, Gleason score gleason, andpercent of Gleason scores 4 or 5 pgg45. Figure 1.1 is a scatterplot matrixof the variables. Some correlations with lpsa are evident, but a good pre-dictivemodel is difficult to construct by eye.This is a supervised learning problem, known as a regression problem,because the outcome measurement is quantitative.Example 3: Handwritten Digit RecognitionThe data from this example come from the handwritten ZIP codes onenvelopes from U.S. postal mail. Each image is a segment from a five digitZIP code, isolating a single digit. The images are 1616 eight-bit grayscalemaps, with each pixel ranging in intensity from 0 to 255. Some sampleimages are shown in Figure 1.2.The images have been normalized to have approximately the same sizeand orientation. The task is to predict, from the 16 16 matrix of pixelintensities, the identity of each image (0, 1, . . . , 9) quickly and accurately. Ifit is accurate enough, the resulting algorithm would be used as part of anautomatic sorting procedure for envelopes. This is a classification problemfor which the error rate needs to be kept very low to avoid misdirection of 24. 1. Introduction 5mail. In order to achieve this low error rate, some objects can be assignedto a dont know category, and sorted instead by hand.Example 4: DNA Expression MicroarraysDNA stands for deoxyribonucleic acid, and is the basic material that makesup human chromosomes. DNA microarrays measure the expression of agene in a cell by measuring the amount of mRNA (messenger ribonucleicacid) present for that gene. Microarrays are considered a breakthroughtechnology in biology, facilitating the quantitative study of thousands ofgenes simultaneously from a single sample of cells.Here is how a DNA microarray works. The nucleotide sequences for a fewthousand genes are printed on a glass slide. A target sample and a referencesample are labeled with red and green dyes, and each are hybridized withthe DNA on the slide. Through fluoroscopy, the log (red/green) intensitiesof RNA hybridizing at each site is measured. The result is a few thousandnumbers, typically ranging from say 6 to 6, measuring the expression levelof each gene in the target relative to the reference sample. Positive valuesindicate higher expression in the target versus the reference, and vice versafor negative values.A gene expression dataset collects together the expression values from aseries of DNA microarray experiments, with each column representing anexperiment. There are therefore several thousand rows representing individ-ualgenes, and tens of columns representing samples: in the particular ex-ampleof Figure 1.3 there are 6830 genes (rows) and 64 samples (columns),although for clarity only a random sample of 100 rows are shown. The fig-uredisplays the data set as a heat map, ranging from green (negative) tored (positive). The samples are 64 cancer tumors from different patients.The challenge here is to understand how the genes and samples are or-ganized.Typical questions include the following:(a) which samples are most similar to each other, in terms of their expres-sionprofiles across genes?(b) which genes are most similar to each other, in terms of their expressionprofiles across samples?(c) do certain genes show very high (or low) expression for certain cancersamples?We could view this task as a regression problem, with two categoricalpredictor variablesgenes and sampleswith the response variable beingthe level of expression. However, it is probably more useful to view it asunsupervised learning problem. For example, for question (a) above, wethink of the samples as points in 6830dimensional space, which we wantto cluster together in some way. 25. 6 1. IntroductionSIDW299104SIDW380102SID73161GNALH.sapiensmRNASID325394RASGTPASESID207172ESTsSIDW377402HumanmRNASIDW469884ESTsSID471915MYBPROTOESTsChr.1SID377451DNAPOLYMERSID375812SIDW31489SID167117SIDW470459SIDW487261HomosapiensSIDW376586ChrMITOCHONDRIAL60SID47116ESTsChr.6SIDW296310SID488017SID305167ESTsChr.3SID127504SID289414PTPRCSIDW298203SIDW310141SIDW376928ESTsCh31SID114241SID377419SID297117SIDW201620SIDW279664SIDW510534HLACLASSISIDW203464SID239012SIDW205716SIDW376776HYPOTHETICALWASWiskottSIDW321854ESTsChr.15SIDW376394SID280066ESTsChr.5SIDW488221SID46536SIDW257915ESTsChr.2SIDW322806SID200394ESTsChr.15SID284853SID485148SID297905ESTsSIDW486740SMALLNUCESTsSIDW366311SIDW357197SID52979ESTsSID43609SIDW416621ERLUMENTUPLE1TUP1SIDW428642SID381079SIDW298052SIDW417270SIDW362471ESTsChr.15SIDW321925SID380265SIDW308182SID381508SID377133SIDW365099ESTsChr.10SIDW325120SID360097SID375990SIDW128368SID301902SID31984SID42354BREASTRENALMELANOMAMELANOMAMCF7D-reproCOLONCOLONK562B-reproCOLONNSCLCLEUKEMIARENALMELANOMABREASTCNSCNSRENALMCF7A-reproNSCLCK562A-reproCOLONCNSNSCLCNSCLCLEUKEMIACNSOVARIANBREASTLEUKEMIAMELANOMAMELANOMAOVARIANOVARIANNSCLCRENALBREASTMELANOMAOVARIANOVARIANNSCLCRENALBREASTMELANOMALEUKEMIACOLONBREASTLEUKEMIACOLONCNSMELANOMANSCLCPROSTATENSCLCRENALRENALNSCLCRENALLEUKEMIAOVARIANPROSTATECOLONBREASTRENALUNKNOWNFIGURE 1.3. DNA microarray data: expression matrix of 6830 genes (rows)and 64 samples (columns), for the human tumor data. Only a random sampleof 100 rows are shown. The display is a heat map, ranging from bright green(negative, under expressed) to bright red (positive, over expressed). Missing valuesare gray. The rows and columns are displayed in a randomly chosen order. 26. 1. Introduction 7Who Should Read this BookThis book is designed for researchers and students in a broad variety offields: statistics, artificial intelligence, engineering, finance and others. Weexpect that the reader will have had at least one elementary course instatistics, covering basic topics including linear regression.We have not attempted to write a comprehensive catalog of learningmethods, but rather to describe some of the most important techniques.Equally notable, we describe the underlying concepts and considerationsby which a researcher can judge a learning method. We have tried to writethis book in an intuitive fashion, emphasizing concepts rather than math-ematicaldetails.As statisticians, our exposition will naturally reflect our backgrounds andareas of expertise. However in the past eight years we have been attendingconferences in neural networks, data mining and machine learning, and ourthinking has been heavily influenced by these exciting fields. This influenceis evident in our current research, and in this book.How This Book is OrganizedOur view is that one must understand simple methods before trying tograsp more complex ones. Hence, after giving an overview of the supervis-inglearning problem in Chapter 2, we discuss linear methods for regressionand classification in Chapters 3 and 4. In Chapter 5 we describe splines,wavelets and regularization/penalization methods for a single predictor,while Chapter 6 covers kernel methods and local regression. Both of thesesets of methods are important building blocks for high-dimensional learn-ingtechniques. Model assessment and selection is the topic of Chapter 7,covering the concepts of bias and variance, overfitting and methods such ascross-validation for choosing models. Chapter 8 discusses model inferenceand averaging, including an overview of maximum likelihood, Bayesian in-ferenceand the bootstrap, the EM algorithm, Gibbs sampling and bagging,A related procedure called boosting is the focus of Chapter 10.In Chapters 913 we describe a series of structured methods for su-pervisedlearning, with Chapters 9 and 11 covering regression and Chap-ters12 and 13 focusing on classification. Chapter 14 describes methods forunsupervised learning. Two recently proposed techniques, random forestsand ensemble learning, are discussed in Chapters 15 and 16. We describeundirected graphical models in Chapter 17 and finally we study high-dimensionalproblems in Chapter 18.At the end of each chapter we discuss computational considerations im-portantfor data mining applications, including how the computations scalewith the number of observations and predictors. Each chapter ends withBibliographic Notes giving background references for the material. 27. 8 1. IntroductionWe recommend that Chapters 14 be first read in sequence. Chapter 7should also be considered mandatory, as it covers central concepts thatpertain to all learning methods. With this in mind, the rest of the bookcan be read sequentially, or sampled, depending on the readers interest.The symbol indicates a technically difficult section, one that canbe skipped without interrupting the flow of the discussion.Book WebsiteThe website for this book is located athttp://www-stat.stanford.edu/ElemStatLearnIt contains a number of resources, including many of the datasets used inthis book.Note for InstructorsWe have successively used the first edition of this book as the basis for atwo-quarter course, and with the additional materials in this second edition,it could even be used for a three-quarter sequence. Exercises are provided atthe end of each chapter. It is important for students to have access to goodsoftware tools for these topics. We used the R and S-PLUS programminglanguages in our courses. 28. This is page 9Printer: Opaque this2Overview of Supervised Learning2.1 IntroductionThe first three examples described in Chapter 1 have several componentsin common. For each there is a set of variables that might be denoted asinputs, which are measured or preset. These have some influence on one ormore outputs. For each example the goal is to use the inputs to predict thevalues of the outputs. This exercise is called supervised learning.We have used the more modern language of machine learning. In thestatistical literature the inputs are often called the predictors, a term wewill use interchangeably with inputs, and more classically the independentvariables. In the pattern recognition literature the term features is preferred,which we use as well. The outputs are called the responses, or classicallythe dependent variables.2.2 Variable Types and TerminologyThe outputs vary in nature among the examples. In the glucose predictionexample, the output is a quantitative measurement, where some measure-mentsare bigger than others, and measurements close in value are closein nature. In the famous Iris discrimination example due to R. A. Fisher,the output is qualitative (species of Iris) and assumes values in a finite setG = {Virginica, Setosa and Versicolor}. In the handwritten digit examplethe output is one of 10 different digit classes: G = {0, 1, . . . , 9}. In both of 29. 10 2. Overview of Supervised Learningthese there is no explicit ordering in the classes, and in fact often descrip-tivelabels rather than numbers are used to denote the classes. Qualitativevariables are also referred to as categorical or discrete variables as well asfactors.For both types of outputs it makes sense to think of using the inputs topredict the output. Given some specific atmospheric measurements todayand yesterday, we want to predict the ozone level tomorrow. Given thegrayscale values for the pixels of the digitized image of the handwrittendigit, we want to predict its class label.This distinction in output type has led to a naming convention for theprediction tasks: regression when we predict quantitative outputs, and clas-sificationwhen we predict qualitative outputs. We will see that these twotasks have a lot in common, and in particular both can be viewed as a taskin function approximation.Inputs also vary in measurement type; we can have some of each of qual-itativeand quantitative input variables. These have also led to distinctionsin the types of methods that are used for prediction: some methods aredefined most naturally for quantitative inputs, some most naturally forqualitative and some for both.A third variable type is ordered categorical, such as small, medium andlarge, where there is an ordering between the values, but no metric notionis appropriate (the difference between medium and small need not be thesame as that between large and medium). These are discussed further inChapter 4.Qualitative variables are typically represented numerically by codes. Theeasiest case is when there are only two classes or categories, such as suc-cessor failure, survived or died. These are often represented by asingle binary digit or bit as 0 or 1, or else by 1 and 1. For reasons that willbecome apparent, such numeric codes are sometimes referred to as targets.When there are more than two categories, several alternatives are available.The most useful and commonly used coding is via dummy variables. Here aK-level qualitative variable is represented by a vector of K binary variablesor bits, only one of which is on at a time. Although more compact codingschemes are possible, dummy variables are symmetric in the levels of thefactor.We will typically denote an input variable by the symbol X. If X isa vector, its components can be accessed by subscripts Xj . Quantitativeoutputs will be denoted by Y , and qualitative outputs by G (for group).We use uppercase letters such as X, Y or G when referring to the genericaspects of a variable. Observed values are written in lowercase; hence theith observed value of X is written as xi (where xi is again a scalar orvector). Matrices are represented by bold uppercase letters; for example, aset of N input p-vectors xi, i = 1, . . . ,N would be represented by the Npmatrix X. In general, vectors will not be bold, except when they have Ncomponents; this convention distinguishes a p-vector of inputs xi for the 30. 2.3 Least Squares and Nearest Neighbors 11Tiith observation from the N-vector xj consisting of all the observations onvariable Xj . Since all vectors are assumed to be column vectors, the ithrow of X is x, the vector transpose of xi.For the moment we can loosely state the learning task as follows: giventhe value of an input vector X, make a good prediction of the output Y,denoted by Y (pronounced y-hat). If Y takes values in IR then so should Y ; likewise for categorical outputs, Gshould take values in the same set G associated with G.For a two-class G, one approach is to denote the binary coded targetas Y , and then treat it as a quantitative output. The predictions Y willtypically lie in [0, 1], and we can assign to Gthe class label according towhether y0.5. This approach generalizes to K-level qualitative outputsas well.We need data to construct prediction rules, often a lot of it. We thussuppose we have available a set of measurements (xi, yi) or (xi, gi), i =1, . . . ,N, known as the training data, with which to construct our predictionrule.2.3 Two Simple Approaches to Prediction: LeastSquares and Nearest NeighborsIn this section we develop two simple but powerful prediction methods: thelinear model fit by least squares and the k-nearest-neighbor prediction rule.The linear model makes huge assumptions about structure and yields stablebut possibly inaccurate predictions. The method of k-nearest neighborsmakes very mild structural assumptions: its predictions are often accuratebut can be unstable.2.3.1 Linear Models and Least SquaresThe linear model has been a mainstay of statistics for the past 30 yearsand remains one of our most important tools. Given a vector of inputsXT = (X1,X2, . . . ,Xp), we predict the output Y via the model Y = 31. 0 +Xpj=1Xj 32. j . (2.1)The term 33. 0 is the intercept, also known as the bias in machine learning.Often it is convenient to include the constant variable 1 in X, include 34. 0 inthe vector of coefficients 35. , and then write the linear model in vector formas an inner product Y = XT 36. , (2.2) 37. 12 2. Overview of Supervised Learningwhere XT denotes vector or matrix transpose (X being a column vector).Here we are modeling a single output, so Y is a scalar; in general Y can bea Kvector, in which case 38. would be a pK matrix of coefficients. In the(p + 1)-dimensional inputoutput space, (X, Y ) represents a hyperplane.If the constant is included in X, then the hyperplane includes the originand is a subspace; if not, it is an affine set cutting the Y -axis at the point(0, 39. 0). From now on we assume that the intercept is included in 40. .Viewed as a function over the p-dimensional input space, f(X) = XT 41. is linear, and the gradient f0(X) = 42. is a vector in input space that pointsin the steepest uphill direction.How do we fit the linear model to a set of training data? There aremany different methods, but by far the most popular is the method ofleast squares. In this approach, we pick the coefficients 43. to minimize theresidual sum of squaresRSS( 44. ) =XNi=1(yi xTi 45. )2. (2.3)RSS( 46. ) is a quadratic function of the parameters, and hence its minimumalways exists, but may not be unique. The solution is easiest to characterizein matrix notation. We can writeRSS( 47. ) = (y X 48. )T (y X 49. ), (2.4)where X is an N p matrix with each row an input vector, and y is anN-vector of the outputs in the training set. Differentiating w.r.t. 50. we getthe normal equationsXT (y X 51. ) = 0. (2.5)If XTX is nonsingular, then the unique solution is given by 52. = (XTX)1XT y, (2.6)and the fitted value at the ith input xi is yi = y(xi) = xTi 53. . At an arbi-traryinput x0 the prediction is y(x0) = xT0 54. . The entire fitted surface ischaracterized by the p parameters 55. . Intuitively, it seems that we do notneed a very large data set to fit such a model.Lets look at an example of the linear model in a classification context.Figure 2.1 shows a scatterplot of training data on a pair of inputs X1 andX2. The data are simulated, and for the present the simulation model isnot important. The output class variable G has the values BLUE or ORANGE,and is represented as such in the scatterplot. There are 100 points in eachof the two classes. The linear regression model was fit to these data, withthe response Y coded as 0 for BLUE and 1 for ORANGE. The fitted values Yare converted to a fitted class variable Gaccording to the ruleG=(ORANGE if Y0.5,BLUE if Y0.5.(2.7) 56. 2.3 Least Squares and Nearest Neighbors 13Linear Regression of 0/1 Response. .. .. . .. . .. . . .. .. . . .. .. .. .. .. .. o.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. . .. . . .. . .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. . .. . .. . . .. . .. .. .. .. .. .. . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . . .. . .. . .. . .. . .. .. .. .. .. .. .. .. .. .. .. . . .. . .. .. . .. . .. . .. .. .. .. .. . .. .. .. .. .. . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. . . . . . . .. ... .ooo ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo ooo ooooooooooooooooooooooooooooooooooooooooooo o ooooo o oooooooo oooo ooooo oooooooooooooooooooooooooo ooooooooooooooFIGURE 2.1. A classification example in two dimensions. The classes are codedas a binary variable (BLUE = 0, ORANGE = 1), and then fit by linear regression.The line is the decision boundary defined by xT 57. = 0.5. The orange shaded regiondenotes that part of input space classified as ORANGE, while the blue region isclassified as BLUE.The set of points in IR2 classified as ORANGE corresponds to {x : xT 58. 0.5},indicated in Figure 2.1, and the two predicted classes are separated by thedecision boundary {x : xT 59. = 0.5}, which is linear in this case. We seethat for these data there are several misclassifications on both sides of thedecision boundary. Perhaps our linear model is too rigidor are such errorsunavoidable? Remember that these are errors on the training data itself,and we have not said where the constructed data came from. Consider thetwo possible scenarios:Scenario 1: The training data in each class were generated from bivariateGaussian distributions with uncorrelated components and differentmeans.Scenario 2: The training data in each class came from a mixture of 10 low-varianceGaussian distributions, with individual means themselvesdistributed as Gaussian.A mixture of Gaussians is best described in terms of the generativemodel. One first generates a discrete variable that determines which of 60. 14 2. Overview of Supervised Learningthe component Gaussians to use, and then generates an observation fromthe chosen density. In the case of one Gaussian per class, we will see inChapter 4 that a linear decision boundary is the best one can do, and thatour estimate is almost optimal. The region of overlap is inevitable, andfuture data to be predicted will be plagued by this overlap as well.In the case of mixtures of tightly clustered Gaussians the story is dif-ferent.A linear decision boundary is unlikely to be optimal, and in fact isnot. The optimal decision boundary is nonlinear and disjoint, and as suchwill be much more difficult to obtain.We now look at another classification and regression procedure that isin some sense at the opposite end of the spectrum to the linear model, andfar better suited to the second scenario.2.3.2 Nearest-Neighbor MethodsNearest-neighbor methods use those observations in the training set T clos-estin input space to x to form Y . Specifically, the k-nearest neighbor fitfor Y is defined as follows: Y (x) =1kXxi2Nk(x)yi, (2.8)where Nk(x) is the neighborhood of x defined by the k closest points xi inthe training sample. Closeness implies a metric, which for the moment weassume is Euclidean distance. So, in words, we find the k observations withxi closest to x in input space, and average their responses.In Figure 2.2 we use the same training data as in Figure 2.1, and use15-nearest-neighbor averaging of the binary coded response as the methodof fitting. Thus Y is the proportion of ORANGEs in the neighborhood, andso assigning class ORANGE to Gif Y0.5 amounts to a majority vote inthe neighborhood. The colored regions indicate all those points in inputspace classified as BLUE or ORANGE by such a rule, in this case found byevaluating the procedure on a fine grid in input space. We see that thedecision boundaries that separate the BLUE from the ORANGE regions are farmore irregular, and respond to local clusters where one class dominates.Figure 2.3 shows the results for 1-nearest-neighbor classification: Y isassigned the value y` of the closest point x` to x in the training data. Inthis case the regions of classification can be computed relatively easily, andcorrespond to a Voronoi tessellation of the training data. Each point xihas an associated tile bounding the region for which it is the closest inputpoint. For all points x in the tile, G(x) = gi. The decision boundary is evenmore irregular than before.The method of k-nearest-neighbor averaging is defined in exactly thesame way for regression of a quantitative output Y , although k = 1 wouldbe an unlikely choice. 61. 2.3 Least Squares and Nearest Neighbors 1515-Nearest Neighbor Classifier. .. .. .. .. .. .. .. .. .. . . .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . . .. .. .. .. .. .. .. .. .. .. .. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . . . . . .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . . .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . . .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . . . . . . . . . . . . . . . . . . .. . .ooo ooooo ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo ooo ooooooooooooooooooooooooooooooooooooooooooooo o ooooo o oooooooo oooo ooooo oooooooooooooooooooooooooo ooooooooooooooFIGURE 2.2. The same classification example in two dimensions as in Fig-ure 2.1. The classes are coded as a binary variable (BLUE = 0, ORANGE = 1) andthen fit by 15-nearest-neighbor averaging as in (2.8). The predicted class is hencechosen by majority vote amongst the 15-nearest neighbors.In Figure 2.2 we see that far fewer training observations are misclassifiedthan in Figure 2.1. This should not give us too much comfort, though, sincein Figure 2.3 none of the training data are misclassified. A little thoughtsuggests that for k-nearest-neighbor fits, the error on the training datashould be approximately an increasing function of k, and will always be 0for k = 1. An independent test set would give us a more satisfactory meansfor comparing the different methods.It appears that k-nearest-neighbor fits have a single parameter, the num-berof neighbors k, compared to the p parameters in least-squares fits. Al-thoughthis is the case, we will see that the effective number of parametersof k-nearest neighbors is N/k and is generally bigger than p, and decreaseswith increasing k. To get an idea of why, note that if the neighborhoodswere nonoverlapping, there would be N/k neighborhoods and we would fitone parameter (a mean) in each neighborhood.It is also clear that we cannot use sum-of-squared errors on the trainingset as a criterion for picking k, since we would always pick k = 1! It wouldseem that k-nearest-neighbor methods would be more appropriate for themixture Scenario 2 described above, while for Gaussian data the decisionboundaries of k-nearest neighbors would be unnecessarily noisy. 62. 16 2. Overview of Supervised Learning1Nearest Neighbor Classifierooooo oooo oo o oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo ooo ooooooooooooooooooooooooooooooooooooooooooooo o ooooo o oooooooo ooo ooooo oooooooo oooooooooooooooo ooooooooooooooFIGURE 2.3. The same classification example in two dimensions as in Fig-ure 2.1. The classes are coded as a binary variable (BLUE = 0, ORANGE = 1), andthen predicted by 1-nearest-neighbor classification.2.3.3 From Least Squares to Nearest NeighborsThe linear decision boundary from least squares is very smooth, and ap-parentlystable to fit. It does appear to rely heavily on the assumptionthat a linear decision boundary is appropriate. In language we will developshortly, it has low variance and potentially high bias.On the other hand, the k-nearest-neighbor procedures do not appear torely on any stringent assumptions about the underlying data, and can adaptto any situation. However, any particular subregion of the decision bound-arydepends on a handful of input points and their particular positions,and is thus wiggly and unstablehigh variance and low bias.Each method has its own situations for which it works best; in particularlinear regression is more appropriate for Scenario 1 above, while nearestneighbors are more suitable for Scenario 2. The time has come to exposethe oracle! The data in fact were simulated from a model somewhere be-tweenthe two, but closer to Scenario 2. First we generated 10 means mkfrom a bivariate Gaussian distribution N((1, 0)T , I) and labeled this classBLUE. Similarly, 10 more were drawn from N((0, 1)T , I) and labeled classORANGE. Then for each class we generated 100 observations as follows: foreach observation, we picked an mk at random with probability 1/10, and 63. 2.3 Least Squares and Nearest Neighbors 17k Number of Nearest Neighbors151 101 69 45 31 21