Data Mining Text Book

8/21/2019 Data Mining Text Book

1/658

Data Mining and Analysis:

Fundamental Concepts and Algorithms

Mohammed J. Zaki

Wagner Meira Jr.


2/658

CONTENTS i

Contents

Preface 1

1 Data Mining and Analysis 41.1 Data Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Data: Algebraic and Geometric View . . . . . . . . . . . . . . . . . . 7

1.3.1 Distance and Angle . . . . . . . . . . . . . . . . . . . . . . . . 91.3.2 Mean and Total Variance . . . . . . . . . . . . . . . . . . . . 131.3.3 Orthogonal Projection . . . . . . . . . . . . . . . . . . . . . . 141.3.4 Linear Independence and Dimensionality . . . . . . . . . . . . 15

1.4 Data: Probabilistic View . . . . . . . . . . . . . . . . . . . . . . . . . 171.4.1 Bivariate Random Variables . . . . . . . . . . . . . . . . . . . 241.4.2 Multivariate Random Variable . . . . . . . . . . . . . . . . . 281.4.3 Random Sample and Statistics . . . . . . . . . . . . . . . . . 29

1.5 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

1.5.1 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . 311.5.2 Frequent Pattern Mining . . . . . . . . . . . . . . . . . . . . . 331.5.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331.5.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

1.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

I Data Analysis Foundations 37

2 Numeric Attributes 382.1 Univariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.1.1 Measures of Central Tendency . . . . . . . . . . . . . . . . . . 392.1.2 Measures of Dispersion . . . . . . . . . . . . . . . . . . . . . . 43

2.2 Bivariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482.2.1 Measures of Location and Dispersion . . . . . . . . . . . . . . 492.2.2 Measures of Association . . . . . . . . . . . . . . . . . . . . . 50


3/658

CONTENTS ii

2.3 Multivariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 542.4 Data Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592.5 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

2.5.1 Univariate Normal Distribution . . . . . . . . . . . . . . . . . 612.5.2 Multivariate Normal Distribution . . . . . . . . . . . . . . . . 63


3 Categorical Attributes 713.1 Univariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.1.1 Bernoulli Variable . . . . . . . . . . . . . . . . . . . . . . . . 713.1.2 Multivariate Bernoulli Variable . . . . . . . . . . . . . . . . . 74

3.2 Bivariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 813.2.1 Attribute Dependence: Contingency Analysis . . . . . . . . . 88

3.3 Multivariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 933.3.1 Multi-way Contingency Analysis . . . . . . . . . . . . . . . . 95

3.4 Distance and Angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 983.5 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1003.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1023.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4 Graph Data 1054.1 Graph Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054.2 Topological Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 1104.3 Centrality Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4.3.1 Basic Centralities . . . . . . . . . . . . . . . . . . . . . . . . . 115

4.3.2 Web Centralities . . . . . . . . . . . . . . . . . . . . . . . . . 1174.4 Graph Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

4.4.1 Erds-Rnyi Random Graph Model . . . . . . . . . . . . . . . 1294.4.2 Watts-Strogatz Small-world Graph Model . . . . . . . . . . . 1334.4.3 Barabsi-Albert Scale-free Model . . . . . . . . . . . . . . . . 139


5 Kernel Methods 1505.1 Kernel Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

5.1.1 Reproducing Kernel Map . . . . . . . . . . . . . . . . . . . . 156

5.1.2 Mercer Kernel Map . . . . . . . . . . . . . . . . . . . . . . . . 1585.2 Vector Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1615.3 Basic Kernel Operations in Feature Space . . . . . . . . . . . . . . . 1665.4 Kernels for Complex Objects . . . . . . . . . . . . . . . . . . . . . . 173

5.4.1 Spectrum Kernel for Strings . . . . . . . . . . . . . . . . . . . 1735.4.2 Diffusion Kernels on Graph Nodes . . . . . . . . . . . . . . . 175


4/658

CONTENTS iii


6 High-Dimensional Data 1826.1 High-Dimensional Objects . . . . . . . . . . . . . . . . . . . . . . . . 1826.2 High-Dimensional Volumes . . . . . . . . . . . . . . . . . . . . . . . . 1846.3 Hypersphere Inscribed within Hypercube . . . . . . . . . . . . . . . . 1876.4 Volume of Thin Hypersphere Shell . . . . . . . . . . . . . . . . . . . 1896.5 Diagonals in Hyperspace . . . . . . . . . . . . . . . . . . . . . . . . . 1906.6 Density of the Multivariate Normal . . . . . . . . . . . . . . . . . . . 1916.7 Appendix: Derivation of Hypersphere Volume . . . . . . . . . . . . . 1956.8 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2006.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

7 Dimensionality Reduction 2047.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2047.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . 209

7.2.1 Best Line Approximation . . . . . . . . . . . . . . . . . . . . 2097.2.2 Best Two-dimensional Approximation . . . . . . . . . . . . . 2137.2.3 Bestr-dimensional Approximation . . . . . . . . . . . . . . . 2177.2.4 Geometry of PCA . . . . . . . . . . . . . . . . . . . . . . . . 222

7.3 Kernel Principal Component Analysis (Kernel PCA) . . . . . . . . . 2257.4 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . 233

7.4.1 Geometry of SVD . . . . . . . . . . . . . . . . . . . . . . . . 2347.4.2 Connection between SVD and PCA . . . . . . . . . . . . . . . 235

7.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

II Frequent Pattern Mining 240

8 Itemset Mining 2418.1 Frequent Itemsets and Association Rules . . . . . . . . . . . . . . . . 2418.2 Itemset Mining Algorithms . . . . . . . . . . . . . . . . . . . . . . . 245

8.2.1 Level-Wise Approach: Apriori Algorithm . . . . . . . . . . . 2478.2.2 Tidset Intersection Approach: Eclat Algorithm . . . . . . . . 2508.2.3 Frequent Pattern Tree Approach: FPGrowth Algorithm . . . 256

8.3 Generating Association Rules . . . . . . . . . . . . . . . . . . . . . . 2608.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2638.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263


5/658

CONTENTS iv

9 Summarizing Itemsets 2699.1 Maximal and Closed Frequent Itemsets . . . . . . . . . . . . . . . . . 2699.2 Mining Maximal Frequent Itemsets: GenMax Algorithm . . . . . . . 273

9.3 Mining Closed Frequent Itemsets: Charm algorithm . . . . . . . . . 2759.4 Non-Derivable Itemsets . . . . . . . . . . . . . . . . . . . . . . . . . . 2789.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2849.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

10 Sequence Mining 28910.1 Frequent Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28910.2 Mining Frequent Sequences . . . . . . . . . . . . . . . . . . . . . . . 290

10.2.1 Level-Wise Mining: GSP . . . . . . . . . . . . . . . . . . . . . 29210.2.2 Vertical Sequence Mining: SPADE . . . . . . . . . . . . . . . 29310.2.3 Projection-Based Sequence Mining: PrefixSpan . . . . . . . . 296

10.3 Substring Mining via Suffix Trees . . . . . . . . . . . . . . . . . . . . 29810.3.1 Suffix Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29810.3.2 Ukkonens Linear Time Algorithm . . . . . . . . . . . . . . . 301

10.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30910.5 E xercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309

11 Graph Pattern Mining 31411.1 Isomorphism and Support . . . . . . . . . . . . . . . . . . . . . . . . 31411.2 Candidate Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 318

11.2.1 Canonical Code . . . . . . . . . . . . . . . . . . . . . . . . . . 32011.3 The gSpan Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 323

11.3.1 Extension and Support Computation . . . . . . . . . . . . . . 326

11.3.2 Canonicality Checking . . . . . . . . . . . . . . . . . . . . . . 33011.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33111.5 E xercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333

12 Pattern and Rule Assessment 33712.1 Rule and Pattern Assessment Measures . . . . . . . . . . . . . . . . 337

12.1.1 Rule Assessment Measures . . . . . . . . . . . . . . . . . . . . 33812.1.2 Pattern Assessment Measures . . . . . . . . . . . . . . . . . . 34612.1.3 Comparing Multiple Rules and Patterns . . . . . . . . . . . . 349

12.2 Significance Testing and Confidence Intervals . . . . . . . . . . . . . 35412.2.1 Fisher Exact Test for Productive Rules . . . . . . . . . . . . . 354

12.2.2 Permutation Test for Significance . . . . . . . . . . . . . . . . 35912.2.3 Bootstrap Sampling for Confidence Interval . . . . . . . . . . 364



6/658

CONTENTS v

III Clustering 370

13 Representative-based Clustering 371

13.1 K-means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37213.2 Kernel K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37513.3 Expectation Maximization (EM) Clustering . . . . . . . . . . . . . . 381

13.3.1 EM in One Dimension . . . . . . . . . . . . . . . . . . . . . . 38313.3.2 EM ind-Dimensions . . . . . . . . . . . . . . . . . . . . . . . 38613.3.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . 39313.3.4 Expectation-Maximization Approach . . . . . . . . . . . . . . 397


14 Hierarchical Clustering 404

14.1 P reliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40414.2 Agglomerative Hierarchical Clustering . . . . . . . . . . . . . . . . . 40714.2.1 Distance between Clusters . . . . . . . . . . . . . . . . . . . . 40714.2.2 Updating Distance Matrix . . . . . . . . . . . . . . . . . . . . 41114.2.3 Computational Complexity . . . . . . . . . . . . . . . . . . . 413

14.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41314.4 Exercises and Projects . . . . . . . . . . . . . . . . . . . . . . . . . . 414

15 Density-based Clustering 41715.1 The DBSCAN Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 41815.2 Kernel Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . 421

15.2.1 Univariate Density Estimation . . . . . . . . . . . . . . . . . 422

15.2.2 Multivariate Density Estimation . . . . . . . . . . . . . . . . 42415.2.3 Nearest Neighbor Density Estimation . . . . . . . . . . . . . . 427

15.3 Density-based Clustering: DENCLUE . . . . . . . . . . . . . . . . . 42815.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43415.5 E xercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434

16 Spectral and Graph Clustering 43816.1 Graphs and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 43816.2 Clustering as Graph Cuts . . . . . . . . . . . . . . . . . . . . . . . . 446

16.2.1 Clustering Objective Functions: Ratio and Normalized Cut . 44816.2.2 Spectral Clustering Algorithm . . . . . . . . . . . . . . . . . . 451

16.2.3 Maximization Objectives: Average Cut and Modularity . . . . 45516.3 Markov Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46316.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47016.5 E xercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471


7/658

CONTENTS vi

17 Clustering Validation 47317.1 External Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474

17.1.1 Matching Based Measures . . . . . . . . . . . . . . . . . . . . 474

17.1.2 Entropy Based Measures . . . . . . . . . . . . . . . . . . . . . 47917.1.3 Pair-wise Measures . . . . . . . . . . . . . . . . . . . . . . . . 48217.1.4 Correlation Measures . . . . . . . . . . . . . . . . . . . . . . . 486

17.2 Internal Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48917.3 Relative Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498

17.3.1 Cluster Stability . . . . . . . . . . . . . . . . . . . . . . . . . 50517.3.2 Clustering Tendency . . . . . . . . . . . . . . . . . . . . . . . 508


IV Classification 516

18 Probabilistic Classification 51718.1 Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517

18.1.1 Estimating the Prior Probability . . . . . . . . . . . . . . . . 51818.1.2 Estimating the Likelihood . . . . . . . . . . . . . . . . . . . . 518

18.2 Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 52418.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52818.4 E xercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528

19 Decision Tree Classifier 53019.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532

19.2 Decision Tree Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 53519.2.1 Split-point Evaluation Measures . . . . . . . . . . . . . . . . 53619.2.2 Evaluating Split-points . . . . . . . . . . . . . . . . . . . . . . 53719.2.3 Computational Complexity . . . . . . . . . . . . . . . . . . . 545


20 Linear Discriminant Analysis 54920.1 Optimal Linear Discriminant . . . . . . . . . . . . . . . . . . . . . . 54920.2 Kernel Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . 55620.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564

20.4 E xercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564

21 Support Vector Machines 56621.1 Linear Discriminants and Margins . . . . . . . . . . . . . . . . . . . 56621.2 SVM: Linear and Separable Case . . . . . . . . . . . . . . . . . . . . 57221.3 Soft Margin SVM: Linear and Non-Separable Case . . . . . . . . . . 577


8/658

CONTENTS vii

21.3.1 Hinge Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57821.3.2 Quadratic Loss . . . . . . . . . . . . . . . . . . . . . . . . . . 582

21.4 Kernel SVM: Nonlinear Case . . . . . . . . . . . . . . . . . . . . . . 583

21.5 SVM Training Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 58821.5.1 Dual Solution: Stochastic Gradient Ascent . . . . . . . . . . . 58821.5.2 Primal Solution: Newton Optimization . . . . . . . . . . . . . 593

22 Classification Assessment 60222.1 Classification Performance Measures . . . . . . . . . . . . . . . . . . 602

22.1.1 Contingency Table Based Measures . . . . . . . . . . . . . . . 60422.1.2 Binary Classification: Positive and Negative Class . . . . . . 60722.1.3 ROC Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 611

22.2 Classifier Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 61622.2.1 K-fold Cross-Validation . . . . . . . . . . . . . . . . . . . . . 617

22.2.2 Bootstrap Resampling . . . . . . . . . . . . . . . . . . . . . . 61822.2.3 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . 62022.2.4 Comparing Classifiers: Pairedt-Test . . . . . . . . . . . . . . 625

22.3 Bias-Variance Decomposition . . . . . . . . . . . . . . . . . . . . . . 62722.3.1 Ensemble Classifiers . . . . . . . . . . . . . . . . . . . . . . . 632


Index 641


9/658

CONTENTS 1

Preface

This book is an outgrowth of data mining courses at RPI and UFMG; the RPI coursehas been offered every Fall since 1998, whereas the UFMG course has been offeredsince 2002. While there are several good books on data mining and related topics,we felt that many of them are either too high-level or too advanced. Our goal was

to write an introductory text which focuses on the fundamental algorithms in datamining and analysis. It lays the mathematical foundations for the core data miningmethods, with key concepts explained when first encountered; the book also tries tobuild the intuition behind the formulas to aid understanding.

The main parts of the book include exploratory data analysis, frequent patternmining, clustering and classification. The book lays the basic foundations of thesetasks, and it also covers cutting edge topics like kernel methods, high dimensionaldata analysis, and complex graphs and networks. It integrates concepts from relateddisciplines like machine learning and statistics, and is also ideal for a course on dataanalysis. Most of the prerequisite material is covered in the text, especially on linearalgebra, and probability and statistics.

The book includes many examples to illustrate the main technical concepts. Italso has end of chapter exercises, which have been used in class. All of the algorithmsin the book have been implemented by the authors. We suggest that the reader usetheir favorite data analysis and mining software to work through our examples, andto implement the algorithms we describe in text; we recommend the R software,or the Python language with its NumPy package. The datasets used and othersupplementary material like project ideas, slides, and so on, are available online atthe books companion site and its mirrors at RPI and UFMG

http://dataminingbook.info http://www.cs.rpi.edu/~zaki/dataminingbook

http://www.dcc.ufmg.br/dataminingbookHaving understood the basic principles and algorithms in data mining and data

analysis, the readers will be well equipped to develop their own methods or use moreadvanced techniques.


10/658

CONTENTS 2

Suggested Roadmaps

The chapter dependency graph is shown in Figure 1. We suggest some typical

roadmaps for courses and readings based on this book. For an undergraduate levelcourse, we suggest the following chapters: 1-3, 8, 10, 12-15, 17-19, and 21-22. For anundergraduate course without exploratory data analysis, we recommend Chapters1, 8-15, 17-19, and 21-22. For a graduate course, one possibility is to quickly goover the material in Part I, or to assume it as background reading and to directlycover Chapters 9-23; the other parts of the book, namely frequent pattern mining(Part II), clustering (Part III), and classification (Part IV) can be covered in anyorder. For a course on data analysis the chapters must include 1-7, 13-14, 15 (Section2), and 20. Finally, for a course with an emphasis on graphs and kernels we suggestChapters 4, 5, 7 (Sections 1-3), 11-12, 13 (Sections 1-2), 16-17, 20-22.

1

2

14 6 7 15 5

13

17

16 20

22

21

4 19

3

18 8

11

12

9 10

Figure 1: Chapter Dependencies

Acknowledgments

Initial drafts of this book have been used in many data mining courses. We receivedmany valuable comments and corrections from both the faculty and students. Ourthanks go to

Muhammad Abulaish, Jamia Millia Islamia, India


11/658

CONTENTS 3

Mohammad Al Hasan, Indiana University Purdue University at Indianapolis Marcio Luiz Bunte de Carvalho, Universidade Federal de Minas Gerais, Brazil

Loc Cerf, Universidade Federal de Minas Gerais, Brazil Ayhan Demiriz, Sakarya University, Turkey Murat Dundar, Indiana University Purdue University at Indianapolis Jun Luke Huan, University of Kansas Ruoming Jin, Kent State University Latifur Khan, University of Texas, Dallas Pauli Miettinen, Max-Planck-Institut fr Informatik, Germany Suat Ozdemir, Gazi University, Turkey Naren Ramakrishnan, Virginia Polytechnic and State University Leonardo Chaves Dutra da Rocha, Universidade Federal de So Joo del-Rei,

Brazil

Saeed Salem, North Dakota State University Ankur Teredesai, University of Washington, Tacoma Hannu Toivonen, University of Helsinki, Finland Adriano Alonso Veloso, Universidade Federal de Minas Gerais, Brazil Jason T.L. Wang, New Jersey Institute of Technology Jianyong Wang, Tsinghua University, China Jiong Yang, Case Western Reserve University

Jieping Ye, Arizona State UniversityWe would like to thank all the students enrolled in our data mining courses at RPIand UFMG, and also the anonymous reviewers who provided technical commentson various chapters. In addition, we thank CNPq, CAPES, FAPEMIG, Inweb the National Institute of Science and Technology for the Web, and Brazils Sciencewithout Borders program for their support. We thank Lauren Cowles, our editor atCambridge University Press, for her guidance and patience in realizing this book.

Finally, on a more personal front, MJZ would like to dedicate the book to Amina,Abrar, Afsah, and his parents, and WMJ would like to dedicate the book to Patricia,Gabriel, Marina and his parents, Wagner and Marlene. This book would not havebeen possible without their patience and support.

Troy Mohammed J. ZakiBelo Horizonte Wagner Meira, Jr.Summer 2013


12/658

CHAPTER 1. DATA MINING AND ANALYSIS 4

Chapter 1

Data Mining and Analysis

Data mining is the process of discovering insightful, interesting, and novel patterns,

as well as descriptive, understandable and predictive models from large-scale data.We begin this chapter by looking at basic properties of data modeled as a data ma-trix. We emphasize the geometric and algebraic views, as well as the probabilisticinterpretation of data. We then discuss the main data mining tasks, which span ex-ploratory data analysis, frequent pattern mining, clustering and classification, layingout the road-map for the book.

1.1 Data Matrix

Data can often be represented or abstracted as annddata matrix, withn rows andd columns, where rows correspond to entities in the dataset, and columns represent

attributes or properties of interest. Each row in the data matrix records the observedattribute values for a given entity. Then d data matrix is given as

D=

X1 X2 Xd

x1 x11 x12 x1dx2 x21 x22 x2d...

... ...

. . . ...

xn xn1 xn2 xnd

wherexi denotes the i-th row, which is a d-tuple given as

xi= (xi1, xi2,

, xid)

and where Xj denotes the j-th column, which is an n-tuple given as

Xj = (x1j , x2j , , xnj )Depending on the application domain, rows may also be referred to as entities,

instances,examples, records, transactions, objects, points, feature-vectors, tuplesand


13/658


so on. Likewise, columns may also be called attributes, properties, features, dimen-sions, variables, fields, and so on. The number of instances n is referred to as thesizeof the data, whereas the number of attributes d is called the dimensionalityof

the data. The analysis of a single attribute is referred to as univariate analysis,whereas the simultaneous analysis of two attributes is called bivariate analysisandthe simultaneous analysis of more than two attributes is called multivariate analysis.

sepal sepal petal petalclass

length width length widthX1 X2 X3 X4 X5

x1 5.9 3.0 4.2 1.5 Iris-versicolorx2 6.9 3.1 4.9 1.5 Iris-versicolorx3 6.6 2.9 4.6 1.3 Iris-versicolorx

4 4.6 3.2 1.4 0.2 Iris-setosax5 6.0 2.2 4.0 1.0 Iris-versicolorx6 4.7 3.2 1.3 0.2 Iris-setosax7 6.5 3.0 5.8 2.2 Iris-virginicax8 5.8 2.7 5.1 1.9 Iris-virginica...

... ...

... ...

...x149 7.7 3.8 6.7 2.2 Iris-virginicax150 5.1 3.4 1.5 0.2 Iris-setosa

Table 1.1: Extract from the Iris Dataset

Example 1.1: Table 1.1 shows an extract of the Iris dataset; the complete dataforms a1505data matrix. Each entity is an Iris flower, and the attributes includesepal length, sepal width, petal lengthand petal widthin centimeters, andthe type or classof the Iris flower. The first row is given as the 5-tuple

x1= (5.9, 3.0, 4.2, 1.5, Iris-versicolor)

Not all datasets are in the form of a data matrix. For instance, more complexdatasets can be in the form of sequences (e.g., DNA, Proteins), text, time-series,

images, audio, video, and so on, which may need special techniques for analysis.However, in many cases even if the raw data is not a data matrix it can usually betransformed into that form via feature extraction. For example, given a database ofimages, we can create a data matrix where rows represent images and columns corre-spond to image features like color, texture, and so on. Sometimes, certain attributesmay have special semantics associated with them requiring special treatment. For


14/658


instance, temporal or spatial attributes are often treated differently. It is also worthnoting that traditional data analysis assumes that each entity or instance is inde-pendent. However, given the interconnected nature of the world we live in, this

assumption may not always hold. Instances may be connected to other instances viavarious kinds of relationships, giving rise to a data graph, where a node representsan entity and an edge represents the relationship between two entities.

1.2 Attributes

Attributes may be classified into two main types depending on their domain, i.e.,depending on the types of values they take on.

Numeric Attributes Anumericattribute is one that has a real-valued or integer-valued domain. For example, Age with domain(Age) = N, where N denotes the setof natural numbers (non-negative integers), is numeric, and so is petal length inTable 1.1, with domain(petal length) =R+ (the set of all positive real numbers).Numeric attributes that take on a finite or countably infinite set of values are calleddiscrete, whereas those that can take on any real value are called continuous. As aspecial case of discrete, if an attribute has as its domain the set{0, 1}, it is called abinaryattribute. Numeric attributes can be further classified into two types:

Interval-scaled: For these kinds of attributes only differences (addition or sub-traction) make sense. For example, attribute temperature measured inC orF is interval-scaled. If it is 20C on one day and 10 C on the followingday, it is meaningful to talk about a temperature drop of 10 C, but it is not

meaningful to say that it is twice as cold as the previous day.

Ratio-scaled: Here one can compute both differences as well as ratios betweenvalues. For example, for attribute Age, we can say that someone who is 20years old is twice as old as someone who is 10 years old.

Categorical Attributes A categoricalattribute is one that has a set-valued do-main composed of a set of symbols. For example, Sex and Education could becategorical attributes with their domains given as

domain(Sex) ={M, F}domain(Education) =

{HighSchool, BS, MS, PhD

}Categorical attributes may be of two types:

Nominal: The attribute values in the domain are unordered, and thus onlyequality comparisons are meaningful. That is, we can check only whether thevalue of the attribute for two given instances is the same or not. For example,


15/658


Sexis a nominal attribute. Alsoclassin Table 1.1 is a nominal attribute withdomain(class) ={iris-setosa, iris-versicolor, iris-virginica}.

Ordinal: The attribute values are ordered, and thus both equality comparisons(is one value equal to another) and inequality comparisons (is one value lessthan or greater than another) are allowed, though it may not be possible toquantify the difference between values. For example, Education is an ordi-nal attribute, since its domain values are ordered by increasing educationalqualification.

1.3 Data: Algebraic and Geometric View

If the d attributes or dimensions in the data matrix D are all numeric, then eachrow can be considered as a d-dimensional point

xi= (xi1, xi2, , xid) Rd

or equivalently, each row may be considered as a d-dimensional column vector (allvectors are assumed to be column vectors by default)

xi =

xi1xi2

...xid

= xi1 xi2 xidT Rd

whereT is the matrix transposeoperator.

Thed-dimensional Cartesian coordinate space is specified via the d unit vectors,called the standard basis vectors, along each of the axes. The j-th standard basisvectorej is thed-dimensional unit vector whose j -th component is 1 and the rest ofthe components are 0

ej = (0, . . . , 1j , . . . , 0)T

Any other vector in Rd can be written as linear combinationof the standard basisvectors. For example, each of the pointsxi can be written as the linear combination

xi= xi1e1+ xi2e2+

+ xided=

d

j=1 xij ejwhere the scalar value xij is the coordinate value along the j-th axis or attribute.


16/658


0

1

2

3

4

0 1 2 3 4 5 6X1

X2

x1 = (5.9, 3.0)

(a)

X1

X2

X3

12

3

45

6

1 2 3

1

2

3

4

x1= (5.9, 3.0, 4.2)

(b)

Figure 1.1: Row x1 as a point and vector in (a) R2 and (b) R3

2

2.5

3.0

3.5

4.0

4.5

4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

X1: sepal length

X2:sepalwid

th

Figure 1.2: Scatter Plot: sepal lengthversus sepal width. Solid circle shows themean point.


17/658


Example 1.2: Consider the Iris data in Table 1.1. If we project the entiredata onto the first two attributes, then each row can be considered as a point

or a vector in 2-dimensional space. For example, the projection of the 5-tuplex1 = (5.9, 3.0, 4.2, 1.5, Iris-versicolor) on the first two attributes is shown inFigure 1.1a. Figure 1.2 shows the scatter plot of all the n = 150 points in the2-dimensional space spanned by the first two attributes. Likewise, Figure 1.1bshowsx1 as a point and vector in 3-dimensional space, by projecting the data ontothe first three attributes. The point (5.9, 3.0, 4.2) can be seen as specifying thecoefficients in the linear combination of the standard basis vectors in R3

x1= 5.9e1+ 3.0e2+ 4.2e3= 5.9

100

+ 3.001

0

+ 4.200

1

=5.93.0

4.2

Each numeric column or attribute can also be treated as a vector in an n-dimensional space Rn

Xj =

x1jx2j

...xnj

If all attributes are numeric, then the data matrix D is in fact an n d matrix,

also written as D Rnd, given as

D=x11 x12 x1dx21 x22 x2d... ... . . . ...xn1 xn2 xnd

= xT1

xT

2 ...xTn

= | | |X1 X2 Xd| | | As we can see, we can consider the entire dataset as an n dmatrix, or equivalentlyas a set ofnrow vectors xTi Rd or as a set ofd column vectors Xj Rn.

1.3.1 Distance and Angle

Treating data instances and attributes as vectors, and the entire dataset as a matrix,enables one to apply both geometric and algebraic methods to aid in the data miningand analysis tasks.

Let a, b Rm be two m-dimensional vectors given as

a=

a1a2...

am

b=

b1b2...

bm


18/658


Dot Product Thedot productbetween aand b is defined as the scalar value

aTb= a1 a2 amb1

b2...bm

=a1b1+ a2b2+ + ambm

=m

i=1

aibi

Length TheEuclidean normor length of a vector a Rm is defined as

a =

a

T

a= a21+ a22+ + a2m= m

i=1 a2iTheunit vectorin the direction ofa is given as

u= a

a =

1

a

a

By definition u has lengthu= 1, and it is also called a normalizedvector, whichcan be used in lieu ofa in some analysis tasks.

The Euclidean norm is a special case of a general class of norms, known as Lp-norm, defined as

ap= |a1|p + |a2|p + + |am|p 1p = mi=1

|ai|p1

p

for any p= 0. Thus, the Euclidean norm corresponds to the case when p= 2.

Distance From the Euclidean norm we can define theEuclidean distancebetweena and b, as follows

(a, b) =a b =

(a b)T(a b) = m

i=1

(ai bi)2 (1.1)

Thus, the length of a vector is simply its distance from the zero vector 0, all of whoseelements are 0, i.e.,a=a 0= (a, 0).

From the generalLp-norm we can define the corresponding Lp-distance function,given as follows

p(a, b) =a bp (1.2)


19/658


Angle The cosine of the smallest angle between vectors a and b, also called thecosine similarity, is given as

cos = aTba b = aaT bb (1.3)Thus, the cosine of the angle between a and b is given as the dot product of the unitvectors aa and

bb .

TheCauchy-Schwartz inequality states that for any vectors aandb in Rm

|aTb| a bIt follows immediately from the Cauchy-Schwartz inequality that

1 cos 1

Since the smallest angle [0, 180]and since cos [1, 1], the cosine similarityvalue ranges from +1 corresponding to an angle of0, to1 corresponding to anangle of180 (or radians).

Orthogonality Two vectors a and b are said to be orthogonalif and only ifaTb=0, which in turn implies that cos = 0, that is, the angle between them is 90 or 2radians. In this case, we say that they have no similarity.

0

1

2

3

4

0 1 2 3 4 5X1

X2

(5, 3)

(1, 4)

ab

ab

Figure 1.3: Distance and Angle. Unit vectors are shown in gray.

Example 1.3 (Distance and Angle): Figure 1.3 shows the two vectors

a=

53

and b=

14


20/658


Using (1.1), the Euclidean distance between them is given as

(a, b) = (5 1)2 + (3 4)2 =

16 + 1 =

17 = 4.12

The distance can also be computed as the magnitude of the vector

a b=

53

14

=

41

sincea b= 42 + (1)2 = 17 = 4.12.The unit vector in the direction ofais given as

ua= a

a = 152 + 32

53

=

134

53

=

0.860.51

The unit vector in the direction ofb can be computed similarly

ub =

0.240.97

These unit vectors are also shown in gray in Figure 1.3.

By (1.3) the cosine of the angle between a and bis given as

cos =

53

T14

52 + 32

12 + 42 =

1734 17=

12

We can get the angle by computing the inverse of the cosine

= cos1

1/

2

= 45

Let us consider the Lp-norm for awith p= 3; we get

a3=

53 + 331/3

= (153)1/3 = 5.34

The distance between aandb using (1.2) for the Lp-norm withp = 3is given as

a b3=

(4, 1)T

3

=

43 + (1)3

1/3

= (63)1/3 = 3.98


21/658


1.3.2 Mean and Total Variance

Mean Themeanof the data matrix Dis the vector obtained as the average of all

the row-vectors

mean(D) = = 1

n

ni=1

xi

Total Variance The total varianceof the data matrix D is the average squareddistance of each point from the mean

var(D) = 1

n

ni=1

(xi,)2 =

1

n

ni=1

xi 2 (1.4)

Simplifying (1.4) we obtain

var(D) = 1

n

ni=1

xi2 2xTi + 2=

1

n

ni=1

xi2 2nT

1

n

ni=1

xi

+ n 2

= 1

n

ni=1

xi2 2nT + n 2

= 1

n

ni=1

xi2

2

The total variance is thus the difference between the average of the squared mag-nitude of the data points and the squared magnitude of the mean (average of thepoints).

Centered Data Matrix Often we need to center the data matrix by making themean coincide with the origin of the data space. Thecentered data matrixis obtainedby subtracting the mean from all the points

Z= D 1 T =

xT1

xT2..

.xTn

T

T

..

.T

=

xT1 TxT2 T

..

.xTn T

=

zT1

zT2..

.zTn

(1.5)

where zi = xi represents the centered point corresponding to xi, and 1 Rnis the n-dimensional vector all of whose elements have value 1. The mean of thecentered data matrix Z is 0Rd, since we have subtracted the mean from all thepoints xi.


22/658


0

1

2

3

4

0 1 2 3 4 5X1

X2

a

b

r=b

p=b

Figure 1.4: Orthogonal Projection

1.3.3 Orthogonal Projection

Often in data mining we need to project a point or vector onto another vector, forexample to obtain a new point after a change of the basis vectors. Let a, b Rmbe two m-dimensional vectors. An orthogonal decompositionof the vector b in thedirection of another vector a, illustrated in Figure 1.4, is given as

b= b+ b = p + r (1.6)

wherep= b is parallel to a, and r= b is perpendicular or orthogonal to a. Thevector p is called the orthogonal projectionor simply projection ofb on the vector

a. Note that the point pRm is the point closest to bon the line passing througha. Thus, the magnitude of the vector r = b p gives the perpendicular distancebetween b and a, which is often interpreted as the residual or error vector betweenthe points b andp.

We can derive an expression for p by noting that p = cafor some scalar c, sincepis parallel to a. Thus, r = b p= b ca. Since p and r are orthogonal, we have

pTr= (ca)T(b ca) =caTb c2aTa= 0

which implies that c= aTb

aTa

Therefore, the projection ofb on ais given as

p= b = ca=

aTb

aTa

a (1.7)


23/658


X1

X2

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0

-1.0

-0.5

0.0

0.5

1.0

1.5

Figure 1.5: Projecting the Centered Data onto the Line

Example 1.4: Restricting the Iris dataset to the first two dimensions, sepallengthand sepal width, the mean point is given as

mean(D) =

5.8433.054

which is shown as the black circle in Figure 1.2. The corresponding centered datais shown in Figure 1.5, and the total variance is var(D) = 0.868 (centering doesnot change this value).

Figure 1.5 shows the projection of each point onto the line , which is the linethat maximizes the separation between the class iris-setosa (squares) from theother two class (circles and triangles). The line is given as the set of all the points

(x1, x2)T satisfying the constraint

x1x2

= c

2.152.75

for all scalars c R.

1.3.4 Linear Independence and Dimensionality

Given the data matrix

D=

x1 x2 xnT

=

X1 X2 Xd


24/658


we are often interested in the linear combinations of the rows (points) or the columns(attributes). For instance, different linear combinations of the original d attributesyield new derived attributes, which play a key role in feature extraction and dimen-

sionality reduction.Given any set of vectors v1, v2, , vk in an m-dimensional vector space Rm,

theirlinear combinationis given as

c1v1+ c2v2+ + ckvkwhere ci R are scalar values. The set of all possible linear combinations of the kvectors is called the span, denoted asspan(v1, , vk), which is itself a vector spacebeing asubspaceofRm. Ifspan(v1, , vk) = Rm, then we say that v1, , vk is aspanning setfor Rm.

Row and Column Space There are several interesting vector spaces associatedwith the data matrix D, two of which are the column space and row space of D.The column spaceofD, denoted col(D), is the set of all linear combinations of thed column vectors or attributes Xj Rn, i.e.,

col(D) =span(X1, X2, , Xd)

By definition col(D) is a subspace ofRn. The row spaceofD, denoted row(D), isthe set of all linear combinations of the nrow vectors or points xi Rd, i.e.,

row(D) =span(x1, x2, , xn)

By definition row(D) is a subspace ofRd

. Note also that the row space ofDis thecolumn space ofDT

row(D) =col(DT)

Linear Independence We say that the vectors v1, , vkare linearly dependentifat least one vector can be written as a linear combination of the others. Alternatively,thek vectors are linearly dependent if there are scalars c1, c2, , ck, at least one ofwhich is not zero, such that

c1v1+ c2v2+ + ckvk =0

On the other hand, v1,

, vk are linearly independentif and only if

c1v1+ c2v2+ + ckvk =0 implies c1 = c2= = ck = 0

Simply put, a set of vectors is linearly independent if none of them can be writtenas a linear combination of the other vectors in the set.


25/658


Dimension and Rank Let S be a subspace of Rm. A basis for S is a set ofvectors in S, say v1, , vk, that are linearly independent and they span S, i.e.,span(v1,

, vk) = S. In fact, a basis is a minimal spanning set. If the vectors in

the basis are pair-wise orthogonal, they are said to form an orthogonal basis for S.If, in addition, they are also normalized to be unit vectors, then they make up anorthonormal basisfor S. For instance, the standard basisfor Rm is an orthonormalbasis consisting of the vectors

e1=

10...0

e2=

01...0

em =

00...1

Any two bases for Smust have the same number of vectors, and the number of

vectors in a basis for S is called the dimensionofS, denoted as dim(S). Since S isa subspace ofRm, we must have dim(S)m.

It is a remarkable fact that, for any matrix, the dimension of its row and columnspace is the same, and this dimension is also called the rank of the matrix. Forthe data matrix D Rnd, we have rank(D) min(n, d), which follows fromthe fact that the column space can have dimension at most d, and the row spacecan have dimension at most n. Thus, even though the data points are ostensiblyin a d dimensional attribute space (the extrinsic dimensionality), ifrank(D) < d,then the data points reside in a lower dimensional subspace ofRd, and in this caserank(D)gives an indication about the intrinsicdimensionality of the data. In fact,with dimensionality reduction methods it is often possible to approximate D Rnd

with a derived data matrix D

Rn

k

, which has much lower dimensionality, i.e.,kd. In this case k may reflect the true intrinsic dimensionality of the data.

Example 1.5: The line in Figure 1.5 is given as = span2.15 2.75T,

with dim() = 1. After normalization, we obtain the orthonormal basis for asthe unit vector

112.19

2.152.75

=

0.6150.788

1.4 Data: Probabilistic View

The probabilistic view of the data assumes that each numeric attribute Xis arandomvariable, defined as a function that assigns a real number to each outcome of anexperiment (i.e., some process of observation or measurement). Formally, X is a


26/658


functionX:O R, whereO, the domain ofX, is the set of all possible outcomesof the experiment, also called the sample space, and R, the rangeofX, is the set ofreal numbers. If the outcomes are numeric, and represent the observed values of the

random variable, then X:O O is simply the identity function: X(v) =v for allv O. The distinction between the outcomes and the value of the random variableis important, since we may want to treat the observed values differently dependingon the context, as seen in Example 1.6.

A random variable X is called a discrete random variable if it takes on onlya finite or countably infinite number of values in its range, whereas X is called acontinuous random variableif it can take on any value in its range.

5.9 6.9 6.6 4.6 6.0 4.7 6.5 5.8 6.7 6.7 5.1 5.1 5.7 6.1 4.95.0 5.0 5.7 5.0 7.2 5.9 6.5 5.7 5.5 4.9 5.0 5.5 4.6 7.2 6.85.4 5.0 5.7 5.8 5.1 5.6 5.8 5.1 6.3 6.3 5.6 6.1 6.8 7.3 5.6

4.8 7.1 5.7 5.3 5.7 5.7 5.6 4.4 6.3 5.4 6.3 6.9 7.7 6.1 5.66.1 6.4 5.0 5.1 5.6 5.4 5.8 4.9 4.6 5.2 7.9 7.7 6.1 5.5 4.64.7 4.4 6.2 4.8 6.0 6.2 5.0 6.4 6.3 6.7 5.0 5.9 6.7 5.4 6.34.8 4.4 6.4 6.2 6.0 7.4 4.9 7.0 5.5 6.3 6.8 6.1 6.5 6.7 6.74.8 4.9 6.9 4.5 4.3 5.2 5.0 6.4 5.2 5.8 5.5 7.6 6.3 6.4 6.35.8 5.0 6.7 6.0 5.1 4.8 5.7 5.1 6.6 6.4 5.2 6.4 7.7 5.8 4.95.4 5.1 6.0 6.5 5.5 7.2 6.9 6.2 6.5 6.0 5.4 5.5 6.7 7.7 5.1

Table 1.2: Iris Dataset: sepal length(in centimeters)

Example 1.6: Consider the sepal length attribute (X1) for the Iris dataset inTable 1.1. All n = 150 values of this attribute are shown in Table 1.2, which liein the range [4.3, 7.9] with centimeters as the unit of measurement. Let us assumethat these constitute the set of all possible outcomesO.

By default, we can consider the attribute X1 to be a continuous random vari-able, given as the identity function X1(v) = v, since the outcomes (sepal lengthvalues) are all numeric.

On the other hand, if we want to distinguish between Iris flowers with shortand long sepal lengths, with long being, say, a length of7cm or more, we can definea discrete random variable A as follows

A(v) = 0 Ifv


27/658


Probability Mass Function IfXis discrete, the probability mass functionofXis defined as

f(x) =P(X=x) for allx RIn other words, the function fgives the probability P(X = x) that the randomvariable Xhas the exact value x. The name probability mass function intuitivelyconveys the fact that the probability is concentrated or massed at only discrete valuesin the range ofX, and is zero for all other values. fmust also obey the basic rulesof probability. That is,fmust be non-negative

f(x) 0and the sum of all probabilities should add to 1

x

f(x) = 1

Example 1.7 (Bernoulli and Binomial Distribution): In Example 1.6, Awas defined as discrete random variable representing long sepal length. From thesepal length data in Table 1.2 we find that only 13 Irises have sepal length of atleast 7cm. We can thus estimate the probability mass function ofA as follows

f(1) =P(A= 1) = 13

150= 0.087 =p

and f(0) =P(A= 0) = 137150

= 0.913 = 1 pIn this case we say that A has a Bernoulli distributionwith parameter p [0, 1],which denotes the probability of a success, i.e., the probability of picking an Iriswith a long sepal length at random from the set of all points. On the other hand,1pis the probability of afailure, i.e., of not picking an Iris with long sepal length.

Let us consider another discrete random variable B, denoting the number ofIrises with long sepal length in m independent Bernoulli trials with probability ofsuccessp. In this case, B takes on the discrete values [0, m], and its probabilitymass function is given by the Binomial distribution

f(k) =P(B= k) = mkpk(1 p)mk

The formula can be understood as follows. There arem

k

ways of picking k long

sepal length Irises out of the m trials. For each selection ofk long sepal lengthIrises, the total probability of theksuccesses ispk, and the total probability ofmk


28/658


failures is (1 p)mk. For example, since p = 0.087 from above, the probability ofobserving exactly k= 2 Irises with long sepal length in m = 10 trials is given as

f(2) =P(B= 2) =10

2

(0.087)2(0.913)8 = 0.164

Figure 1.6 shows the full probability mass function for different values of k form= 10. Sincep is quite small, the probability ofk successes in so few a trials fallsoff rapidly as k increases, becoming practically zero for values ofk6.

0.1

0.2

0.3

0.4

0 1 2 3 4 5 6 7 8 9 10k

P(B=k)

Figure 1.6: Binomial Distribution: Probability Mass Function (m= 10, p= 0.087)

Probability Density Function If X is continuous, its range is the entire setof real numbers R. The probability of any specific value x is only one out of theinfinitely many possible values in the range ofX, which means that P(X=x) = 0for all x R. However, this does not mean that the value x is impossible, sincein that case we would conclude that all values are impossible! What it means isthat the probability mass is spread so thinly over the range of values, that it can bemeasured only over intervals [a, b] R, rather than at specific points. Thus, insteadof the probability mass function, we define the probability density function, which


29/658


specifies the probability that the variable Xtakes on values in any interval[a, b] R

PX[a, b]=b

a

f(x)dx

As before, the density function fmust satisfy the basic laws of probability

f(x)0, for allx R

and

f(x)dx= 1

We can get an intuitive understanding of the density function f by considering

the probability density over a small interval of width 2 >0, centered at x, namely[x , x+ ]

P

X[x , x + ]= x+x

f(x)dx 2 f(x)

f(x) P

X[x , x + ]2

(1.8)

f(x) thus gives the probability density at x, given as the ratio of the probabilitymass to the width of the interval, i.e., the probability mass per unit distance. Thus,

it is important to note that P(X=x)=f(x).Even though the probability density functionf(x)does not specify the probabil-ityP(X=x), it can be used to obtain the relative probability of one value x1 overanotherx2, since for a given >0, by (1.8), we have

P(X[x1 , x1+ ])P(X[x2 , x2+ ])

2 f(x1)2 f(x2) =

f(x1)

f(x2) (1.9)

Thus, iff(x1) is larger than f(x2), then values ofXclose to x1 are more probablethan values close to x2, and vice versa.

Example 1.8 (Normal Distribution): Consider again the sepal length val-ues from the Iris dataset, as shown in Table 1.2. Let us assume that these valuesfollow a Gaussianor normaldensity function, given as

f(x) = 1

22exp

(x )222


30/658


0

0.1

0.2

0.3

0.4

0.5

2 3 4 5 6 7 8 9 x

f(x)

Figure 1.7: Normal Distribution: Probability Density Function ( = 5.84, 2 =0.681)

There are two parameters of the normal density distribution, namely, , whichrepresents the mean value, and 2, which represents the variance of the values(these parameters will be discussed in Chapter 2). Figure 1.7 shows the character-istic bell shape plot of the normal distribution. The parameters, = 5.84 and2 = 0.681, were estimated directly from the data for sepal length in Table 1.2.

Whereasf(x= ) =f(5.84) = 12 0.681 exp{0}= 0.483, we emphasize that

the probability of observing X = is zero, i.e., P(X= ) = 0. Thus,P(X=x)is not given byf(x), rather, P(X=x)is given as the area under the curve for aninfinitesimally small interval [x , x+ ] centered at x, with > 0. Figure 1.7illustrates this with the shaded region centered at = 5.84. From (1.8), we have

P(X=) 2 f() = 2 0.483 = 0.967

As 0, we get P(X=)0. However, based on (1.9) we can claim that theprobability of observing values close to the mean value = 5.84 is 2.67 times theprobability of observing values close to x= 7, since

f(5.84)

f(7) =

0.483

0.18 = 2.69


31/658


Cumulative Distribution Function For any random variable X, whether dis-crete or continuous, we can define the cumulative distribution function (CDF) F :R

[0, 1], that gives the probability of observing a value at most some given value

x

F(x) =P(Xx) for all < x


32/658


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.80.9

1.0

0 1 2 3 4 5 6 7 8 9 10x

F(x)

(, F()) = (5.84, 0.5)

Figure 1.9: Cumulative Distribution Function for the Normal Distribution

Figure 1.9 shows the cumulative distribution function for the normal densityfunction shown in Figure 1.6. As expected, for a continuous random variable,the CDF is also continuous, and non-decreasing. Since the normal distribution issymmetric about the mean, we have F() =P(X) = 0.5.

1.4.1 Bivariate Random Variables

Instead of considering each attribute as a random variable, we can also perform pair-wise analysis by considering a pair of attributes, X1 andX2, as a bivariate randomvariable

X=

X1X2

X:O R2 is a function that assigns to each outcome in the sample space, a pairof real numbers, i.e., a 2-dimensional vector

x1x2

R2. As in the univariate case,

if the outcomes are numeric, then the default is to assume X to be the identityfunction.

Joint Probability Mass Function IfX1 andX2 are both discrete random vari-ables then X has ajoint probability mass functiongiven as follows

f(x) =f(x1, x2) =P(X1 = x1, X2= x2) =P(X= x)


33/658


fmust satisfy the following two conditions

f(x) =f(x1, x2)

0 for all

< x1, x2 x

is a binary indicator variablethat indicates whether the given condition is satisfiedor not. Intuitively, to obtain the empirical CDF we compute for each value xR,how many points in the sample are less than or equal to x. The empirical CDF puts

a probability mass of

1

n at each point xi. Note that we use the notation Fto denotethe fact that the empirical CDF is an estimate for the unknown population CDF F.

Inverse Cumulative Distribution Function Define theinverse cumulative dis-tribution functionorquantile functionfor a random variable Xas follows

F1(q) = min{x| F(x)q} forq[0, 1] (2.2)

That is, the inverse CDF gives the least value of X, for which q fraction of thevalues are higher, and 1 q fraction of the values are lower. Theempirical inversecumulative distribution functionF1 can be obtained from (2.1).

Empirical Probability Mass Function (PMF) Theempirical probability massfunctionofX is given as

f(x) =P(X=x) = 1

n

ni=1

I(xi = x) (2.3)

where

I(xi= x) =

1 ifxi= x

0 ifxi=xThe empirical PMF also puts a probability mass of 1n at each point xi.

2.1.1 Measures of Central Tendency

These measures given an indication about the concentration of the probability mass,the middle values, and so on.


48/658

CHAPTER 2. NUMERIC ATTRIBUTES 40

Mean

Themean, also called the expected value, of a random variable Xis the arithmetic

average of the values of X. It provides a one-number summary of the location orcentral tendencyfor the distribution ofX.

The mean or expected value of a discrete random variable Xis defined as

= E[X] =

x

x f(x) (2.4)

wheref(x)is the probability mass function ofX.The expected value of a continuous random variable Xis defined as

= E[X] =

xf(x)dx

wheref(x) is the probability density function ofX.

Sample Mean Thesample meanis a statistic, i.e., a function :{x1, x2, , xn} R, defined as the average value ofxis

= 1

n

ni=1

xi (2.5)

It serves as an estimator for the unknown mean value ofX. It can be derived byplugging in the empirical PMF f(x) in (2.4)

=

x

xf(x) =

x

x

1

n

ni=1

I(xi=x)

=

1

n

ni=1

xi

Sample Mean is Unbiased An estimator is called an unbiased estimator forparameter if E[] = for every possible value of . The sample mean is anunbiased estimator for the population mean , since

E[] =E

1

n

ni=1

xi

=

1

n

ni=1

E[xi] = 1

n

ni=1

= (2.6)

where we use the fact that the random variables xi are IID according to X, whichimplies that they have the same mean as X, i.e.,E[xi] = for allxi. We also usedthe fact that the expectation function Eis alinear operator, i.e., for any two randomvariablesXand Y, and real numbersaandb, we haveE[aX+ bY] =aE[X]+bE[Y].


49/658


Robustness We say that a statistic isrobustif it is not affected by extreme values(such as outliers) in the data. The sample mean is unfortunately not robust, sincea single large value (an outlier) can skew the average. A more robust measure is

the trimmed meanobtained after discarding a small fraction of extreme values onone or both ends. Furthermore, the mean can be somewhat misleading in thatit is typically not a value that occurs in the sample, and it may not even be avalue that the random variable can actually assume (for a discrete random variable).For example, the number of cars per capita is an integer valued random variable,but according to the US Bureau of Transportation Studies, the average number ofpassenger cars in the US was 0.45 in 2008 (137.1 million cars, with a populationsize of 304.4 million). Obviously, one cannot own 0.45 cars; it can be interpreted assaying that on average there are 45 cars per 100 people.

Median

Themedianof a random variable is defined as the value m such that

P(Xm)12

andP(Xm) 12

In other words, the median m is the middle-most value; half of the values of Xare less and half of the values of X are more than m. In terms of the (inverse)cumulative distribution function, the median is therefore the value m for which

F(m) = 0.5 orm = F1(0.5)

Thesample mediancan be obtained from the empirical CDF (2.1) or the empiricalinverse CDF (2.2) by computing

F(m) = 0.5 orm = F1(0.5)

A simpler approach to compute the sample median is to first sort all the values xi(i[1, n]) in increasing order. Ifn is odd, the median is the value at position n+12 .Ifn is even, the values at positions n2 and

n2 + 1 are both medians.

Unlike the mean, median is robust, since it is not affected very much by extremevalues. Also, it is a value that occurs in the sample and a value the random variablecan actually assume.

Mode

Themodeof a random variable Xis the value at which the probability mass function

or the probability density function attains its maximum value, depending on whetherX is discrete or continuous, respectively.

The sample mode is a value for which the empirical probability function (2.3)attains its maximum, given as

mode(X) = arg maxx

f(x)


50/658


The mode may not be a very useful measure of central tendency for a sample,since by chance an unrepresentative element may be the most frequent element.Furthermore, if all values in the sample are distinct, each of them will be the mode.

4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0X1

Frequency

= 5.843

Figure 2.1: Sample Mean for sepal length. Multiple occurrences of the same valueare shown stacked.

Example 2.1 (Sample Mean, Median, and Mode): Consider the attributesepal length (X1) in the Iris dataset, whose values are shown in Table 1.2. Thesample mean is given as follows

= 1

150(5.9 + 6.9 + + 7.7 + 5.1) =876.5

150 = 5.843

Figure 2.1 shows all 150 values of sepal length, and the sample mean. Figure 2.2a

shows the empirical CDF and Figure 2.2b shows the empirical inverse CDF forsepal length.

Sincen= 150 is even, the sample median is the value at positions n2 = 75 andn2 + 1 = 76 in sorted order. For sepal length both these values are 5.8, thus thesample median is 5.8. From the inverse CDF in Figure 2.2b, we can see that

F(5.8) = 0.5 or5.8 = F1(0.5)

The sample mode for sepal length is 5, which can be observed from thefrequency of 5 in Figure 2.1. The empirical probability mass at x= 5is

f(5) = 10

150= 0.067


51/658


0

0.25

0.50

0.75

1.00

4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

x

F(x)

(a) Empirical CDF

4

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

0 0.25 0.50 0.75 1.00q

F1(q)

(b) Empirical Inverse CDF

Figure 2.2: Empirical CF and Inverse CDF: sepal length

2.1.2 Measures of Dispersion

The measures of dispersion give an indication about the spread or variation in thevalues of a random variable.

Range

The value rangeor simply rangeof a random variable X is the difference betweenthe maximum and minimum values ofX, given as

r= max{X} min{X}


52/658


The (value) range ofXis a population parameter, not to be confused with the rangeof the function X, which is the set of all the values Xcan assume. Which range isbeing used should be clear from the context.

Thesample rangeis a statistic, given as

r= nmax

i=1{xi}

nmini=1

{xi}

By definition, range is sensitive to extreme values, and thus is not robust.

Inter-Quartile Range

Quartilesare special values of the quantile function (2.2), that divide the data into4 equal parts. That is, quartiles correspond to the quantile values of0.25, 0.5,0.75,and 1.0. The first quartile is the value q1 =F

1(0.25), to the left of which 25% ofthe points lie, the second quartileis the same as the median value q

2= F

1(0.5), to

the left of which 50% of the points lie, the third quartile q3= F1(0.75)is the value

to the left of which 75% of the points lie, and the fourth quartile is the maximumvalue ofX, to the left of which 100% of the points lie.

A more robust measure of the dispersion ofX is the inter-quartile range (IQR),defined as

IQR = q3 q1= F1(0.75) F1(0.25) (2.7)IQR can also be thought of as a trimmed range, where we discard 25% of the lowand high values ofX. Or put differently, it is the range for the middle 50% of thevalues ofX. IQR is robust by definition.

The sample IQR can be obtained by plugging in the empirical inverse CDF in(2.7)

IQR = q3 q1= F1(0.75) F1(0.25)

Variance and Standard Deviation

The varianceof a random variable Xprovides a measure of how much the valuesofXdeviate from the mean or expected value ofX. More formally, variance is theexpected value of the squared deviation from the mean, defined as

2 =var(X) =E

(X )2= x

(x )2 f(x) ifX is discrete

(x )2 f(x)dx ifX is continuous(2.8)

Thestandard deviation, , is defined as the positive square root of the variance, 2.


53/658


We can also write the variance as the difference between the expectation ofX2

and the square of the expectation ofX

2

=var(X) =E[(X )2

] =E[X2

2X+ 2

]=E[X2] 2E[X] + 2 =E[X2] 22 + 2=E[X2] (E[X])2 (2.9)

It is worth noting that variance is in fact the second moment about the mean,corresponding to r = 2, which is a special case of the r-th moment about the meanfor a random variable X, defined as E[(x )r].

Sample Variance Thesample varianceis defined as

2 = 1

n

n

i=1(xi )2 (2.10)

It is the average squared deviation of the data values xi from the sample mean ,and can be derived by plugging in the empirical probability function f from (2.3)into (2.8), since

2 =

x

(x )2f(x) =

x

(x )2

1

n

ni=1

I(xi = x)

=

1

n

ni=1

(xi )2

Thesample standard deviationis given as the positive square root of the samplevariance

= 1nn

i=1(xi )2The standard score, also called the z-score , of a sample value xi is the number

of standard deviations away the value is from the mean

zi =xi

Put differently, the z-score ofxi measures the deviation ofxi from the mean value, in units of.

Geometric Interpretation of Sample Variance We can treat the data samplefor attributeXas a vector in n-dimensional space, wheren is the sample size. That

is, we write X= (x1, x2, , xn)T Rn. Further, let

Z=X 1 =

x1 x2

...xn


54/658


denote the mean subtracted attribute vector, where 1 Rn is the n-dimensionalvector all of whose elements have value 1. We can rewrite (2.10) in terms of themagnitude ofZ, i.e., the dot product ofZ with itself

2 = 1

nZ2 = 1

nZTZ=

1

n

ni=1

(xi )2 (2.11)

The sample variance can thus be interpreted as the squared magnitude of the centeredattribute vector, or the dot product of the centered attribute vector with itself,normalized by the sample size.

Example 2.2: Consider the data sample for sepal lengthshown in Figure 2.1.We can see that the sample range is given as

maxi{xi} mini{xi}= 7.9 4.3 = 3.6

From the inverse CDF for sepal lengthin Figure 2.2b, we can find the sampleIQR as follows

q1= F1(0.25) = 5.1

q3= F1(0.75) = 6.4

IQR = q3 q1= 6.4 5.1 = 1.3

The sample variance can be computed from the centered data vector via theexpression (2.11)

2 = 1

n(X 1 )T(X 1 ) = 102.168/150 = 0.681

The sample standard deviation is then

=

0.681 = 0.825

Variance of the Sample Mean Since the sample mean is itself a statistic, wecan compute its mean value and variance. The expected value of the sample mean is

simply, as we saw in (2.6). To derive an expression for the variance of the samplemean, we utilize the fact that the random variables xi are all independent, and thus

var

ni=1

xi

=

ni=1

var(xi)


55/658


Further since all the xis are identically distributed asX, they have the same varianceas X, i.e.,

var(xi) =2 for alli

Combining the above two facts, we get

var

ni=1

xi

=

ni=1

var(xi) =n

i=1

2 =n2 (2.12)

Further, note that

E

ni=1

xi

= n (2.13)

Using (2.9), (2.12), and (2.13), the variance of the sample mean can be com-puted as

var() =E[( )2] =E[2] 2 =E 1

n

ni=1

xi

2 1n2

E

ni=1

xi

2

= 1

n2

E n

i=1

xi

2 E ni=1

xi

2= 1n2

var

ni=1

xi

=2

n (2.14)

In other words, the sample mean varies or deviates from the mean in proportionto the population variance 2. However, the deviation can be made smaller byconsidering larger sample size n.

Sample Variance is Biased, but is Asymptotically Unbiased The samplevariance in (2.10) is a biased estimator for the true population variance, 2, i.e.,E[2]=2. To show this we make use of the identity

n

i=1(xi )2 =n( )2 +

n

i=1(xi )2 (2.15)

Computing the expectation of2 by using (2.15) in the first step, we get

E[2] =E

1

n

ni=1

(xi )2

= E

1

n

ni=1

(xi )2

E[( )2] (2.16)


56/658


Recall that the random variables xi are IID according to X, which means that theyhave the same mean and variance 2 asX. This means that

E[(xi )2

] =

2

Further, from (2.14) the sample mean has variance E[( )2] = 2n . Pluggingthese into the (2.16) we get

E[2] = 1

nn2

2

n

=

n 1

n

2

The sample variance 2 is a biased estimator of2, since its expected value differsfrom the population variance by a factor of n1n . However, it is asymptoticallyunbiased, that is, the bias vanishes as n , since

limn n 1n = limn 1 1n = 1

Put differently, as the sample size increases, we have

E[2]2 as n

2.2 Bivariate Analysis

In bivariate analysis, we consider two attributes at the same time. We are specificallyinterested in understanding the association or dependence between them, if any. Wethus restrict our attention to the two numeric attributes of interest,X1and X2, withthe data Drepresented as an n

2matrix

D=

X1 X2x11 x12x21 x22

... ...

xn1 xn2

Geometrically, we can think ofD in two ways. It can be viewed asnpoints or vectorsin two dimensional space over the attributes X1 andX2, i.e., xi = (xi1, xi2)

T R2.Alternatively, it can be viewed as two points or vectors in an n-dimensional spacecomprising the points, i.e., each column is a vector in Rn, as follows

X1= (x11, x21,

, xn1)T

X2= (x12, x22, , xn2)TIn the probabilistic view, the column vector X = (X1, X2)

T is considered abivariate vector random variable, and the points xi (1 i n) are treated as arandom sample drawn from X, i.e., xis are considered independent and identicallydistributed as X.


57/658


Empirical Joint Probability Mass Function The empirical joint probabilitymass function for Xis given as

f(x) =P(X= x) = 1n

ni=1

I(xi=x) (2.17)

f(x1, x2) =P(X1=x1, X2 = x2) = 1

n

ni=1

I(xi1= x1, xi2= x2)

whereI is a indicator variable which takes on the value one only when its argumentis true

I(xi=x) =

1 ifxi1 = x1 andxi2= x2

0 otherwise

As in the univariate case, the probability function puts a probability mass of 1n at

each point in the data sample.

2.2.1 Measures of Location and Dispersion

Mean The bivariate mean is defined as the expected value of the vector randomvariable X, defined as follows

= E[X] =E

X1X2

=

E[X1]

E[X2]

=

12

(2.18)

In other words, the bivariate mean vector is simply the vector of expected valuesalong each attribute.

The sample mean vector can be obtained from fX1 and fX2, the empirical proba-bility mass functions ofX1andX2, respectively, using (2.5). It can also be computedfrom the joint empirical PMF in (2.17)

=x

xf(x) =x

x

1

n

ni=1

I(xi= x)

=

1

n

ni=1

xi (2.19)

Variance We can compute the variance along each attribute, namely 21 for X1and22 forX2 using (2.8). The total variance(1.4) is given as

var(D) =2

1+ 2

2

The sample variances 21 and 22 can be estimated using (2.10), and the sample

total varianceis simply 21+ 22 .


58/658


2.2.2 Measures of Association

Covariance Thecovariancebetween two attributesX1and X2provides a measure

of the association or linear dependence between them, and is defined as12=E[(X1 1)(X2 2)] (2.20)

By linearity of expectation, we have

12= E[(X1 1)(X2 2)]=E[X1X2 X12 X21+ 12]=E[X1X2] 2E[X1] 1E[X2] + 12=E[X1X2] 12=E[X1X2] E[X1]E[X2] (2.21)

The expression above can be seen as a generalization of the univariate variance (2.9)to the bivariate case.

If X1 and X2 are independent random variables, then we conclude that theircovariance is zero. This is because ifX1 andX2 are independent, then we have

E[X1X2] =E[X1] E[X2]which in turn implies that

12 = 0

However, the converse is not true. That is, if12 = 0, one cannot claim that X1 andX2 are independent. All we can say is that there is no linear dependence betweenthem, but we cannot rule out that there might be a higher order relationship or

dependence between the two attributes.The sample covariancebetween X1 andX2 is given as

12= 1

n

ni=1

(xi1 1)(xi2 2) (2.22)

It can be derived by substituting the empirical joint probability mass functionf(x1, x2)from (2.17) into (2.20), as follows

12= E[(X11)(X22)]=

x=(x1,x2)T(x1 1)(x2 2)f(x1, x2)

= 1

n

x=(x1,x2)T

ni=1

(x1 1) (x2 2) I(xi1 = x1, xi2= x2)

= 1

n

ni=1

(xi1 1)(xi22)


59/658


Notice that sample covariance is a generalization of the sample variance (2.10),since

11= 1n

ni=1

(xi 1)(xi 1) = 1nn

i=1

(xi 1)2 = 21

and similarly, 22= 22 .

Correlation Thecorrelationbetween variables X1 andX2 is the standardized co-variance, obtained by normalizing the covariance with the standard deviation of eachvariable, given as

12= 1212

= 12

2122

(2.23)

The sample correlationfor attributes X1 andX2 is given as

12= 1212

=ni=1(xi1 1)(xi22)ni=1(xi1 1)2

ni=1(xi2 2)2

(2.24)

xn

x2

x1

Z2

Z1

Figure 2.3: Geometric Interpretation of Covariance and Correlation. The twocentered attribute vectors are shown in the (conceptual) n-dimensional space Rn

spanned by the n points.

Geometric Interpretation of Sample Covariance and Correlation Let Z1

andZ2 denote the centered attribute vectors in Rn, given as follows

Z1= X1 1 1=

x11 1x21 1

...xn1 1

Z2=X2 1 2=

x12 2x22 2

...xn2 2


60/658


The sample covariance (2.22) can then be written as

12

=ZT1Z2

n

In other words, the covariance between the two attributes is simply the dot productbetween the two centered attribute vectors, normalized by the sample size. Theabove can be seen as a generalization of the univariate sample variance given in(2.11).

The sample correlation (2.24) can be written as

12= ZT1Z2ZT1Z1

ZT2Z2

= ZT1Z2Z1 Z2 =

Z1Z1

T Z2Z2

= cos (2.25)

Thus, the correlation coefficient is simply the cosine of the angle (1.3) between the

two centered attribute vectors, as illustrated in Figure 2.3.

Covariance Matrix The variance-covariance information for the two attributesX1 andX2 can be summarized in the square 2 2 covariance matrix, given as

= E[(X )(X )T]

=E

X1 1X2 2

X1 1 X2 2

=

E[(X1 1)(X1 1)] E[(X1 1)(X2 2)]E[(X2 2)(X1 1)] E[(X2 2)(X2 2)]

= 21 1221

22

(2.26)Since 12 = 21, is a symmetric matrix. The covariance matrix records theattribute specific variances on the main diagonal, and the covariance information onthe off-diagonal elements.

The total variance of the two attributes is given as the sum of the diagonalelements of, which is also called the traceof, given as

var(D) =tr() =21+ 22

We immediately have tr()

0.

The generalized varianceof the two attributes also considers the covariance, inaddition to the attribute variances, and is given as the determinantdet() of thecovariance matrix ; it is also denoted as||. The generalized covariance is non-negative, since

|| = det() =2122 212= 2122 2122122 = (1 212)2122


61/658


where we used (2.23), i.e., 12 =1212. Note that|12| 1 implies that 2121,which in turn implies that det() 0, i.e., the determinant is non-negative.

Thesample covariance matrixis given as

= 21 1212

22

The sample covariance matrix shares the same properties as, i.e., it is symmetricand|| 0, and it can be used to easily obtain the sample total and generalizedvariance.

2

2.5

3.0

3.5

4.0

4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0X1: sepal length

X2:sepalwidth

Figure 2.4: Correlation between sepal lengthand sepal width

Example 2.3 (Sample Mean and Covariance): Consider the sepal lengthand sepal width attributes for the Iris dataset, plotted in Figure 2.4. Thereare n = 150 points in the d = 2 dimensional attribute space. The sample meanvector is given as

= 5.843

3.054The sample covariance matrix is given as

= 0.681 0.0390.039 0.187


62/658


The variance for sepal length is 21 = 0.681, and that for sepal width is22 = 0.187. The covariance between the two attributes is 12 =0.039, andthe correlation between them is

12= 0.039

0.681 0.187 =0.109

Thus, there is a very weak negative correlation between these two attributes, asevidenced by the best linear fit line in Figure 2.4. Alternatively, we can consider theattributes sepal lengthand sepal widthas two points in Rn. The correlation isthen the cosine of the angle between them; we have

12= cos =0.109, which implies that = cos1(0.109) = 96.26

The angle is close to 90, i.e., the two attribute vectors are almost orthogonal,indicating weak correlation. Further, the angle being greater than 90 indicatesnegative correlation.

The sample total variance is given as

tr() = 0.681 + 0.187 = 0.868and the sample generalized variance is given as

|| =det() = 0.681 0.187 (0.039)2 = 0.126

2.3 Multivariate Analysis

In multivariate analysis, we consider all the d numeric attributes X1, X2, , Xd.The full data is an n d matrix, given as

D=

X1 X2 Xdx11 x12 x1dx21 x22 x2d

... ...

. . . ...

xn1 xn2 xnd

In the row view, the data can be considered as a set ofn points or vectors in thed-dimensional attribute space

xi= (xi1, xi2, , xid)T Rd


63/658


In the column view, the data can be considered as a set ofd points or vectors in then-dimensional space spanned by the data points

Xj = (x1j , x2j , , xnj)T

Rn

In the probabilistic view, the d attributes are modeled as a vector random vari-able, X = (X1, X2, , Xd)T, and the points xi are considered to be a randomsample drawn from X, i.e., they are independent and identically distributed as X.

Mean Generalizing (2.18), themultivariate mean vectoris obtained by taking themean of each attribute, given as

= E[X] =

E[X1]E[X2]

...

E[Xd]

=

12...

d

Generalizing (2.19), the sample meanis given as

= 1

n

ni=1

xi

Covariance Matrix Generalizing (2.26) tod-dimensions, the multivariate covari-ance information is captured by the d d(square) symmetriccovariance matrixthatgives the covariance for each pair of attributes

= E[(X )(X )T

] = 21 12 1d21

22

2d

d1 d2 2d

The diagonal element 2i specifies the attribute variance for Xi, whereas the off-diagonal elements ij =ji represent the covariance between attribute pairs Xi andXj .

Covariance Matrix is Positive Semi-definite It is worth noting that is apositive semi-definitematrix, i.e.,

aTa0 for any d-dimensional vector aTo see this, observe that

aTa= aTE

(X )(X )Ta=E

aT(X )(X )Ta

=E

Y2

0


64/658


whereY is the random variable Y =aT(X ) =di=1 ai(Xi i), and we use thefact that the expectation of a squared random variable is non-negative.

Since is also symmetric, this implies that all the eigenvalues of are real

and non-negative. In other words the d eigenvalues of can be arranged from thelargest to the smallest as follows: 1 2 d 0. A consequence is thatthe determinant ofis non-negative

det() =d

i=1

i0 (2.27)

Total and Generalized Variance The total variance is given as the trace of thecovariance matrix

var(D) =tr() =21+ 22+

+ 2d (2.28)

Being a sum of squares, t

Documents

Data Mining Text Book