Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction &...

Preview:

Citation preview

1

Lecture 1(i): Introduction & Roadmap

Foundations of Data Science:Algorithms and Mathematical Foundations

Mihai Cucuringumihai.cucuringu@stats.ox.ac.uk

CDT in Mathematics of Random SystemUniversity of Oxford

September 25, 2019

2

Course overview

I combines both theoretical and practical approachesI Goal 1: understand the mathematical, statistical and

algorithmic foundations behind some of the state-of-the-artalgorithms for tasks in machine learning/data miningI organization and visualization of data cloudsI dimensionality reduction •I clustering •I network analysis •I classificationI regressionI ranking •

I Goal 2: exposure to practical examples drawn from a widerange of topics including social network analysis, finance,statistics, etc

3

BooksTextbook• An Introduction to Statistical Learning by James, Witten,Hastie, and Tibshirani, freely available at

http://faculty.marshall.usc.edu/gareth-james/ISL/

Slides will also be made available.

• More advanced + good reference book: The Elements ofStatistical Learning by Hastie, Tibshirani, and Friedman.http://www-stat.stanford.edu/ElemStatLearn.

• Popular reference in the ML/data mining: Pattern Recognitionand Machine Learning Book, by Christopher Bishop.

• Ten Lectures and Forty-Two Open Problems in theMathematics of Data Science, by Afonso S. Bandeira

https://people.math.ethz.ch/˜abandeira/TenLecturesFortyTwoProblems.pdf

4

PrerequisitesProbability:I event, random variable, indicator variableI probability mass function, probability density function

cumulative distribution functionI joint distribution, marginal distributionI conditional probability, Bayes’s ruleI independenceI expectation, varianceI uniform, exponential, binomial, Poisson, Gaussian

distributionsStatistics:I Sampling from a population; mean, variance, standard

deviation, median, covariance, correlation, and theirsample versions; histogram;

I linear regression, response and predictor variables,coefficients, residuals.

5

Prerequisites

Linear algebra:I vector and matrix arithmeticI eigenvalues and eigenvectors of a matrix

Programming:I arithmetic (scalar, vector, and matrix operations)I writing functionsI reading in data sets from csv filesI using and manipulating data structures (subset a data

frame)I installing, loading, and using packagesI plotting

Wilson et al., Good enough practices in scientific computing, PLoScomputational biology, 2017 Jun 22;13(6)

6

A list of tentative topics1. Introduction & syllabus2. Statistical learning3. Measures of correlation and dependency4. Linear regression 1 (Simple regression)5. Linear regression 2 (Multiple regression)6. Singular Value Decomposition (SVD), rank-k

approximation, Principal Component Analysis (PCA)7. PCA in high dimensions and random matrix theory

(Marcenko-Pastur); applications to finance8. Basics of spectral graph theory

Nonlinear dimensionality reduction methods:9. Diffusion Maps and Vector Diffusion Maps

10. Multidimensional scaling and ISOMAP, Locally LinearEmbedding (LLE)

11. Kernel PCA

7

Topics

Clustering:12. Clustering: K-means, K-medoids, Hierarchical clustering13. Spectral clustering and Cheeger’s inequality14. Constrained clustering, clustering of signed graphs and

correlation clustering

Modern regression:15. Ridge regression16. LASSO

Estimation from pairwise measurements:17. The Page-Rank algorithm18. Ranking from pairwise incomplete noisy information19. Group synchronization

8

Additional Topics (if we had more time)

1. Classification: logistic regression, support vectormachines, linear discriminant analysis.

2. Tree-based methods for classification and regression

3. Bagging, Boosting, Random Forests

4. The Multiplicative Weights Update Method: a MetaAlgorithm and Applications• Arora, Sanjeev, Elad Hazan, Satyen Kale; Theory of Computing 2012

5. Johnson-Lindenstrauss Lemma; approximate nearestneighbors

9

Additional Topics (if we had more time)

Other potential topics7. electrical networks and random walks

8. graph sparsification and Laplacian linear system solvers

9. graph embeddings from noisy distances

10. low-rank matrix completion

11. non-negative matrix factorization

12. matrix algorithms using sampling; sketch of a large matrix

10

What is data science/data mining?Given (usually very large) data sets, how do you discoverstructural properties and make predictions?

Two very broad categories of problems:I Unsupervised learning: discover structure. E.g., given

measurements X1, . . . ,Xn, learn some underlying groupstructure based on the pattern of similarity between pairsof points (”working blind”)

I Supervised learning: make predictions. E.g., givenmeasurements (X1,Y1), ...(Xn,Yn), learn a model topredict Yi from Xi

I Semi-supervised learning: only for m (with m << n)observations we have both predictor + responsemeasurements.

11

This course

I Combines both applied and theoretical perspectives,though for some of the topics the emphasis will be on thealgorithms/methodology

I Aim to understand what is it that we are trying to doI Often, it’s not enough to load your data in

R/Matlab/Python, use any available packages and expectto get an answer that makes sense (reasonable)

I Understand your data! Data is often very messy, noisyand incomplete

I Be able to identify what the end goal is, and based on that(and the given data) identify what tools are available

12

Tradeoffs: exact versus approximation

I Many problems are computationally hard to solve exactlyI In order to come up with tractable algorithms (that run in

polynomial time) aim for an approximate solutionI Approximation algorithms can often perform well

(sometimes they even find the exact solution!) and scalewell computationally when applied to very large problems

I Polynomial time algorithms are often not enough, someapplications demand close to linear-time complexity(sometimes even sublinear!)

13

Bias-variance tradeoffIn supervised learning, when moving beyond the training set:I If the model is too simple, the solution is biased and does

not fit the dataI If the model is too complex, the solution is very sensitive to

small changes in the dataProblem of simultaneously minimizing two sources of error

I bias: difference btwn truth and what you expect to learnI high bias leads to missing out on the relevant relations

between features and target outputs (underfitting).I decreases with more complex models

I variance: difference between what you learn from aparticular dataset and what you expect to learn. Arisesfrom sensitivity to small fluctuations in the training set.I high variance leads to overfitting: modeling the random

noise in the training data, rather than the intended outputs.I variance decreases with simpler models

14

Interpretability versus forecasting power

I trade-off between a model that is interpretable and onethat predicts well under general circumstances

I essay on the distinction between explanatory andpredictive modeling:

To Explain or to Predict?, by Galit Shmueli, StatisticalScience 2010, Vol. 25, No. 3, 289–310https://www.stat.berkeley.edu/˜aldous/157/

Papers/shmueli.pdf

15

Occam’s razor

I a problem-solving principle known as the ’law of parsimony’I when faced with different competing hypotheses that

predict equally well, choose the one with the fewestassumptions

I usually, more complex models may provide betterpredictions, but in the absence of differences in predictivepower, the fewer assumptions the better

16

Statistics vs. Machine Learning

I Brian D. Ripley: ”machine learning is statistics minus anychecking of models and assumptions”

”Statistical Modeling: The Two Cultures”, Leo Breiman,Statistical Science, 16 (3), 2001; argued thatI statisticians rely too heavily on data modeling and

assumptionsI machine learning techniques are making progress by

instead relying on the predictive accuracy of models

17

Statistics vs. Machine Learning

More recently, statisticians focused more on finite-sampleproperties, algorithms and massive data sets (big data).Some, still ongoing, differences between the two communities:I Statistics papers are more formal and often comes with

proofs, while Machine Learning papers are more open tonew methodologies even if the theory is lacking (for now)

I The Machine Learning community primarily publishes inconferences and proceedings, while statisticians usejournal papers (much slower process)

I Some statisticians still focus on areas which are welloutside the scope of ML (survey design, sampling,industrial statistics, survival analysis, etc)

18

Final thoughts: No universal data mining recipe bookI hard to say which methods will work best in what situationsI sometimes, it’s crystal clear what method one should followI most often, we have little intuition a-priori on what

approach or set of tools should we use. Need toI understand the data first and the task at handI understand the methods and their assumptionsI make an educated guess on how to proceed

I often, customized tools are required to handle problemparticularities (eg., response variable is highly imbalanced,as in default rate prediction)

I sometimes (if enough resources are available) one oftentries many different methods, and chooses the one whichgives best (out-of-sample) resultsI most competitions are won by Ensemble methods -

techniques that create multiple models and then combinethem to produce improved results

I stacking: considers heterogeneous weak learners, learnsthem in parallel and combines by training a meta-model

19

Alzheimer’s Markers Predict Start of Mental Decline

20

Classification of handwritten digits

Figure: Automatic detection of handwritten postal codes.

21

Facebook: friend suggestions

Figure: People you may know.

22

Forecasting the stock market

Figure: Price of Google stock.

Source: http://businessforecastblog.com/forecasting-googles-stock-price-goog-on-20-trading-day-horizons/

23

Netflix: the $ 1,000,000 Prize

Figure: Movies you might enjoy.

24

eHarmony: predict whom you might fall in love with

Figure: Love is in the air (statistically speaking)

25

Identifying patterns in migration networks

Figure: Eigenvector colourings for the similarity matrix Wij =M2

ijPi Pj

.

26

Financial time series clustering• Clustering of the empirical correlation matrix of 1500 timeseries (stocks contained in the S&P 1500 index)• Compute the bottom k = 10 eigenvectors of L, and run astandard machine learning clustering algorithm (k-means++) torecover k clusters.

Figure: Left: the adjacency matrix A with rows/columns sorted inaccordance to cluster membership. Right: Sector decomposition ofthe recovered clusters (based on a standard classification of the USeconomy into sectors). See link for details: GICS link.M. Cucuringu, P. Davies, A. Glielmo, H. Tyagi, SPONGE: A generalized eigenproblemfor clustering signed networks, AISTATS 2019 (python code available)

27

Yogi Berra:

It’s tough to make predictions, especially about the future

28

Lecture 1(ii): Statistical Learning

Chapter 2 in ”An Introduction to Statistical Learning” by James, Witten, Hastie,and Tibshirani, freely available at

http://faculty.marshall.usc.edu/gareth-james/ISL/

29

Advertising data set

I sales of a product in 200 different marketsI + budgets for the product in each of those markets for

three different media: TV, radio, and newspaperI Goal: predict sales given the three media budgetsI input variables (denoted by X1,X2, . . .)

I X1 TV budgetI X2 radio budgetI X3 newspaper budget

I inputs known as such as predictors, independent variables,features, variables

I the output variable (sales) is the response or dependentvariable (denoted by Y )

30

Advertising data set

31

I observe a quantitative response Y and p differentpredictors, X = (X1,X2, . . . ,Xp)

I assume there is some relationship between Y andX = (X1,X2, . . . ,Xp)

Y = f (X ) + ε

I f is fixed but unknown functionI ε is a random error term, independent of X , with mean zero

32

Why Estimate f?

Task I: PredictionI Y = f (X )

I f is our estimate for fI Y is the resulting predictionI it’s OK if f is a black box algorithmI we only care about forecasting Y

33

Reducible error vs. irreducible errorReducible error:I f is an imperfect estimate for f ; leads to reducible errorI we can potentially improve the accuracy of f by using more

complex models to estimate fIrreducible error:I even if we perfectly estimate f and

Y = f (X )

our prediction is not perfectI recall Y is also a function of ε, and ε ⊥ XI variability associated with ε also affects the accuracy of our

predictionsI most often is positive (ε may contain unmeasured variables

that impact Y ; unmeasurable variation)

34

Prediction

Fix a given estimate f and a set of predictors XOur prediction Y = f (X )Consider the average/expected value of the squared error:

E[Y − Y ]2 = E[f (X ) + ε− f (X )]2 = E[(f (X )− f (X )) + ε]2 (1)

= E[(f (X )− f (X ))2 + 2(f (X )− f (X ))ε+ ε2

](2)

= E[(f (X )− f (X ))2] + 0 + E[ε2] (3)

= E[f (X )− f (X )]2 + Var[ε] (4)

= [f (X )− f (X )]2︸ ︷︷ ︸Reducible

+ Var[ε]︸ ︷︷ ︸Irreducible

(5)

Our goal: develop techniques for estimating f and minimize thereducible error.

35

Task II: Inference

Want to understand how Y changes as a function of X1, . . . ,Xp

Cannot treat f as a black box; need to know its exact form.I Which predictors are associated with the response?

I most often, only a small fraction of the predictors arerelevant for Y

I What is the relationship between the response and eachpredictor?I postive/negative relationship; depending on third predictors

I Can the relationship between Y and each predictor beadequately summarized using a linear equation, or is thisnot enough?I most data out there is non-linear; need non-linear methods

36

Inference for the Advertising data set

I Which media contribute to sales?I Which media generate the biggest boost in sales? orI How much increase in sales is associated with a given

increase in TV advertising?

37

Prediction + inference

Real estate setting: relate values of homes toI crime rate, zoning, distance from a river, air quality,

schools, income level of community, size of housesInference:I How individual input variables affect the prices?I How much extra will a house be worth if it has a view of the

river?Prediction:I Predict the value of a home given its characteristics

38

Linear vs Nonlinear

I linear methods: allow for relatively simple and interpretableinference, but worse prediction

I non-linear methods: very accurate predictions for Y , butless interpretable model thus making inference morechallenging

39

Estimating f

I Parametric methods: Linear Regression, Model Selection,Generalized Linear Models, Mixture Models, Classification(linear, logistic, support vector machines), GraphicalModels, Structured Prediction, Hidden Markov Models

I Nonparametric methods: Nonparametric Regression andDensity Estimation, Nonparametric Classification,Boosting, Clustering and Dimension Reduction, PCA,Manifold Methods, Principal Curves, Spectral Methods,The Bootstrap and Subsampling, Nonparametric Bayes.

40

Parametric Methods1. make an assumption about the the function form or shape

of f , e.g., in a linear model, f can be linear

f (X ) = β0 + β1X1 + β2X2 + . . .+ βpXp

2. fit or train the model; only need to estimate the p + 1coefficients βi s.t.

Y ≈ β0 + β1X1 + β2X2 + . . .+ βpXp

I use (ordinary) least squares or generalized linear modelsI reduced the problem of finding f to finding p + 1 coefficients

Downside: the chosen linear model might not match the(unknown) ground truth form of f .If we aim for more flexible and complex models:I need to estimate a larger number of parametersI may lead to overfitting (fit the noise too closely)

41

Income data set

Figure: The blue surface represents the true underlying relationshipbetween income and years of education and seniority, which is knownsince the data are simulated. The red dots indicate the observedvalues of these quantities for 30 individuals.

42

Linear model

income ≈ β0 + β1 × education + β2 × seniority

43

Linear model for the Income data

income ≈ β0 + β1 × education + β2 × seniority

I True f has some curvature not captured by the linear fitI OK for capturing the positive relationship between years of

education/seniority and income

44

Non-parametric Methods

I no explicit assumptions about the functional form of f(assumption-free approach)

I search for an estimate of f that best fits the data, withsome smoothness assumptions

I +: could fit a more wide range of functions fI -: a much larger number of observations is needed

(compared to parametric approaches)

Example: splines with varying levels of smoothness (allowingfor a rougher/tighter fit)

45

Smooth splines

46

Rough splines

Figure: Perfect fit on the given data!

47

Rough splines (cont.)

Figure: ... but far from the ground truth f (right plot)!

This is a good example of overfitting: the fit does not produceaccurate estimates of the response on new observations thatwere not part of the original training data set.

48

The Trade-Off Between Prediction Accuracy andModel Interpretability

I linear regression: fairly inflexibleI splines: considerably more flexible (can fit a much wider

range of possible shapes to estimate f )Inference:I then restrictive models are much more interpretable.I linear model: easy to understand the relationship between

Y and X1,X2, . . . ,Xp

Very flexible approaches (splines, boosting, etc)I can lead to such complicated estimates of fI hard to understand how any individual predictor is

associated with the responseLASSO:I less flexibleI linear model + sparsity of [β0, β1, . . . , βp]I more interpretable; only a small subset of predictors matter

49

Flexibility vs. InterpretabilityGeneralized additive models (GAMs)I extend the linear model to allow for certain non-linear

relationships

y = β0 + f1(xi,1) + f2(xi,2) + . . .+ fp(xi,p) + εi

I more flexible than linear regressionI less interpretable than linear regression (the relationship

predictor vs. response is modeled by a curve)I main limitation: the model is restricted to be additive (with

many variables, important interactions can be missed)

Fully non-linear methods: bagging, boosting, and supportvector machines with non-linear kernelsI highly flexibleI harder to interpret

50

Flexibility vs. Interpretability

51

For inferenceI better off using simple and relatively inflexible statistical

learning methods.For predictionI interpretability is not an issueI shall we use the most flexible model available?I surprisingly, this is not always the case!I sometimes we get more accurate (out-of-sample)

predictions using a less flexible methodI overfitting in highly flexible methods is a real

concern/danger

52

Supervised Versus Unsupervised LearningI supervised: there exists a response YI unsupervised: there is NO response Y (e.g., clustering)

I semi-supervised learning: some of the observations havean associated response Y , but some do not.

Personal note:I unsupervised learning: a means to an end!I in many/most applications, the ultimate goal is predictionI augment the unsupervised learning results (clustering,

non-linear dimensionality reduction) with a prediction step(even a simple linear regression)

53

Performance evaluation

I mean squared error (MSE)

MSE =1n

n∑i=1

(yi − f (xi))2

I f (xi) is the prediction of f observation iI training MSEI test dataI we care most/only about out-of-sample performanceI test MSE

Average(y0 − f (x0))2

on the test observations (x0, y0) we have not seen yet

54

Flexibility vs. MSE (non-linear)

Horizontal dashed line indicates Var(ε), the irreducible error.

55

Flexibility vs. MSE (linear)

56

Remarks

I test MSE >> train MSEI U-shape in the test MSEI as model flexibility increases, training MSE will decrease,

but the test MSE may notI overfitting: training MSE <<< test MSEI overfitting: when fitting the noise rather than properties of fI overfitting: when a less flexible model yields a smaller test

MSEI cross-validation: estimating test MSE using the training

data

57

R data sets (texbook)

I The UC Irvine Machine Learning Data Set Repository:http://archive.ics.uci.edu/ml/

I S&P 500/1500: daily open and close prices for 2003-2019I Networks data sets: directed, signed, temporal, multilayer

Recommended