Applied Statistics and Machine Learning€¦ · 2. Random split CV: randomly select (1-α) × 100 % data units for training and use the remaining α × 100 % data units for testing

Applied Statistics and Machine Learning

Prediction, Least Squares, Linear Regression

Bin Yu, IMA, June 18, 2013

6/17/13

6/17/13 2

Much of science, social science, and engineering is about finding relationships between predictor variables and and response variables. If the response variables are observable, it is called regression (when response is continuous) or classification (when the response is discrete).

And they are both called supervised learning.

Prediction

6/17/13 3

Examples include: using the expert labels in the cloud problem to find a rule to predict expert label from MSIR image data; using the answers to the dialect questions to guess the origin of the person answering the questionnaire.

Even for basic science research, prediction is also a first order goal

since if a model can’t predict well, it is hard to believe it can explain the underlying mechanisms.

However, for finite data, it is possible that a wrong statistical model

with estimated parameters gives better prediction than a correct model with estimated parameters.

Prediction

6/17/13 4

Much of recent machine learning research has been on prediction in the batch mode. That is, based on a set of training data, a prediction scheme is developed and then this scheme is repeatedly applied to future data.

Prediction and evaluation

6/17/13 5

Batch-mode statistical prediction problem Given training data that are units of predictor variables and

response variable, or (x_i,y_i), i=1,...,n, where x_i ∈ R^p, a vector of p real values and y_i a scalar, one finds a prediction rule f(x_i), mapping the predictor vector to a response. So that for a new predictor x, we can predictor the response variable by f(x).

When the response y_i is a vector, it is called multi-response

prediction problem.

Prediction

6/17/13 6

Fundamental assumption of feasibility of prediction:

The available situation is similar, with respect to the aspects of the situation to be predicted on, to a new situation that prediction is needed for.

Describe situations where predictions are needed and approaches

taken:

Prediction

6/17/13 7

Fundamental assumption of feasibility of good data-driven prediction:

Data to be predicted for are similar, with respect to the aspects of the data to be predicted for, to the training data.

This implies that prediction rules learned from training data

(available situations) are applicable to new predictor variables (new situations).

Prediction with data

6/17/13 8

Describe a good prediction rule: compare predicted value with observed value. Sometimes, one

prediction point decides. Other times, we want to see an average performance.

How to evaluate a prediction rule?

6/17/13 9

Things to consider: • A measure of prediction error: a loss function between a

prediction of a response and a realization of a response that matches the situation at hand. L2 is a convenient choice, but not always the appropriate one.

• An estimated prediction error (on an average sense). • Uncertainty in the estimated prediction error. • Computational cost: memory space, access time, and CPU time. • Interpretability.

How to evaluate “objectively” a prediction rule or based on data?

6/17/13 10

This is basically the idea of using test data (in contrast to training

data on which the prediction rule was based). Replicate the situations many times in the future (a few times

might be unstable). Or set aside a portion (usually less than 1/3) of the training data before investigation (proper testing data).

How to choose the this porition? randomly? why or why not

Estimating prediction error via test data and cross-validation (CV)

6/17/13 11

When the training data is limited, one often re-uses, when

appropriate (“symmetry in data”), the same data units for training and testing via CV (cross-validation).

WARNING: when the number of data units is small relative to the

complexity of the prediction rule, CV, as an estimate of the prediction error can be far away from the prediction error and can’t be trusted.

CV error is t NOT the prediction error!


6/17/13 12

(Cross-validation). There are two versions of cross-validataion: 1.  V-fold CV: divide the training data into V equally sized groups.

Cycle through the V groups by using the V-1 groups for training and the left out group for testing. Add up the errors from the V operations and average. (It is common to use 10 fold.)

2.  Leave one out: this is a special case of V-fold CV where each group consists of only on data unit. The way of manipulation of data is closely related to jacknife (started in the late 40’s) for estimating bias and later variance (Tukey in the 50’s).


6/17/13 13

2. Random split CV: randomly select (1-α) × 100 % data units for

training and use the remaining α × 100 % data units for testing. Repeat this process M times and average the resulted prediction errors. It is common to use α = 0.1 or 0.2, and M = 50 or 100 or 200.

Fundamental assumption for cross-validation: The data units are

exchangeable or there is symmetry. For stock return data or economic index data, this assumption is

violated.


6/17/13 14

There is also an on-line mode of prediction where the accumulated prediction error is the performance metric: data come in sequentially, and for each new data point, the model/rule is updated and a prediction is casted and then compared with the next response.

The prediction errors are added up or accumulated. This formulation is also called Prequential analysis in the works of P. Dawid and is closely related to predictive coding and predictive form of Minimum Description Length (MDL) of Rissanen.

This can als be viewed as one-way cross-validation because the next observation (future) is not involved in its own prediction.

Estimating prediction error one-way cross-validation (CV)

6/17/13 15

(One-Way Cross-validation). Assume data units form some natural order like in a time series.

Start with an initial predictor which could take into account subject knowledge or information from previous data. At time t, use all the data up to this time to form a predictor, then use the observation at t + 1 to evaluate the prediction error.

Add all the prediction errors and average. This accumulated prediction error can be used to compare

different prediction rules and select among competing models.

Estimating prediction error one-way cross-validation (CV)

6/17/13 16

Data perturbation schemes are now routinely employed to estimate bias, variance, and sampling distribution of an estimator.

Jacknife

… Quenouille (1949, 1956), Tukey (1958), Mosteller and Tukey (1977),or

Efron and Stein (1981), Wu (1986), Carlstein (1986), Wu (1990), …

Sub-sampling:

… Mahalanobis (1949), Hartigan (1969, 1970), Politis and Romano

(1992 , 94), Bickel, Götze, and van Zwet (1997), …

Cross-validation:

… Allen (1974), Stone (1974, 1977), Hall (1983), Li (1985), …

Bootstrap:

… Efron (1979), Bickel and Freedman (1981), Beran (1984),

Other data perturbation schemes

Least Squares (LS) Method

6/18/13 17

Gauss’ method to find the trajectory of Ceres can be viewed as a ingenious iterative and approximate method to solve a Least

Squares (LS) problem, based on “domain knowledge”. Guass had more than two parameters, but for the sake of

visualization, let us look at the LS function in 2-dim. Given n data units , the Least Squares function in

2-dim to minimize is

f(β1,β2) =n�

i=1

(yi − xi1β1 − xi2β2)2

(xi1, xi2, yi)ni=1

page 18 June 18, 2013

LS function surfaces with solutions for different sets of simulated data

LS function with when p=2

6/17/13 19

(x_i, y_i), i=1…, n We fit a line to the data cloud through LS with x as the predictor

and y as the response Find (a,b) such that \sum_i (y_i –a-b_i)^2 is minimized. The LS line is not the symmetry line unless all data points fall exactly on a line. There are usually two LS lines depending on which is the

predictor.

Least Squares method with one predictor

6/17/13 20

If x_i’s are mid-term scores and y_i’s are final exam scores, they almost for sure won’t fall on a straight line. Due to the departures of data points from a line, the low performance

group (those with scores below the average) will get better on average at the final and the high performance

group (those with scores above the average) will get worse on average for the final. Regression fallacy in this situation says that the improvement of the

low performance group is not due to variability of data, but something else such as students’ efforts.

Regression fallacy:

6/17/13 21

Data Design matrix X: Response vector Y: LS seeks in to minimize which is a quadratic function of Solution: when X is full rank

Least Squares (LS) method in general

(xi, yi)ni=1, xi ∈ Rp, yi ∈ R1

n× p

n× 1

βols

||Y −Xβ||2

β

βols = (X �X)−1X �Y

Rp

6/17/13 22

Y: indicator vector: 1 if “legal”; 0 otherwise. X: 10 randomly selected columns or 10 columns selected based on absolute correlation. For two class problems like this, it is fine to use LS to find a fit to data, but not

for three-class or other multi-class problems. This is because how to code the categorical class labels into numbers do not matter, for LS automatically adjusts to a scale or shift change.

Least Squares (LS) method with Enron data

6/17/13 23

The plots on the right show

fitted values vs observed Y

and histograms of residuals:

for randomly selected X

on the left column and corr-

selected X on the right column.

Corr-selected X gives a better fit.

Least Squares (LS) method with Enron data

6/17/13 24

For corr-selected X, we compare the LS fits

of X as design matrix and sqrt(X) as design matrix.

The plots on the right show

fitted values vs observed Y

and histograms of residuals:

for X in the left column and Sqrt(X)

In the right column.

Sqrt(X) gives a better fit.

Enron data: corr-selected X, sqrt transform.

6/17/13 25

Fitted value vector where is a projection matrix to the linear

space spanned by the columns of X. Residual vector: -- this is a matrix identity, not linear

reg. model

LS solution

Y = Xβ = HY

H = X(X �X)−1

X�

e = Y − Y = Y −HY

Y = Y + e = Xβ + e

||Xβ||2 + ||e||2 = ||Y ||2

6/17/13 26

LS is very sensitive to outliers. Statistical leverage scores measures the “outlierness” of data

points. For data point i, the ith diagonal element of H is its leverage score which is a very useful concept from regression diagnostics and

Robust Statistics.

LS solution

6/17/13 27

When X is not full-rank, we can still take a basis of span (X) and project, but the solution to the LS problem is not unique anymore in terms of fitting a linear combination of columns of X. When in modern regime of p>n, X is never full-rank.

We can get a pseudo-inverse of (X’X) by using Singular Value

Decomposition (SVD). Square matrix U is unitary iff U^{-1} = U’

Pseudo-inverse of X’X via SVD

6/17/13 28

Randomized algorithms for matrices and dataMichael W. Mahoney (2011) (Sec. 4)

http://arxiv.org/abs/1104.5557 Compexity of LS: O(p^2 n). Sample rows with p_i = ||U(i)||^2/p where U(i) is the ith row of the n by p matrix that consists of the first p eigen vectors in U.

(Claim: ||U(i)||^2 is the leverage score for i). Then we can sample r= O(p log p/c^2) rows to get c-close to the LS solution using full data.

Sub-sampling of rows of X using leverage scores to reduce computation of LS when n>>p

6/17/13 29

For i = 1, …, n, and a fixed design matrix X, What are we saying with this math. expression above? Does it always hold, approximately? What could go wrong? Does it mean that we can freely set x_i’s value? Subject matter matters!

Linear regression model

Yi = x�iβ + �i

�i iid (0,σ2)

6/17/13 30

For i = 1, …, n, and a fixed design matrix X, what does it mean for our movie-fMRI data? Need to introduce Gabor wavelets…

Example: modeling of V1 voxel

Yi = x�iβ + �i

Modern Description of Hubel-Wiesel work: Early Visual Area V1

6/18/13 31

}  Preprocessing an image:

}  Gabor filters corresponding to particular spatial frequencies, locations, orientations (Hubel and

Wiesel, 1959,…)

Sparse representation after Gabor Filters, static or dynamic

page 32 June 18, 2013

2D Gabor Features

3D Gabor Features

6/18/13 33

}  Data split to 1 second movie clips

}  3D Gabor Filters applied to get features of a movie clip in 26K dim

page 34 June 18, 2013

Movie-fMRI: linear encoding model for a voxel For each voxel and the ith movie clip, suppose we know which p

features are driving that voxel, we postulate a linear encoding (regression) model: where is the feature vector of movie clip weight vector that combines feature strengths into mean fMRI response is the disturbance or noise term is the fMRI response Why linear? First approximation to any functional form and it

worked for V1 single neuron data.

Xi = (xi1, xi2, ..., xip)T

Yi = β1xi1 + β2xi2 + ...+ βpxip + �i = XTi β + �i

�i

Yi

β = (β1,β2, ...,βp)T

6/17/13 35

For i = 1, …, n, and a fixed design matrix X, If what is the maximum likelihood estimator of beta? Same as LS. What is maximum likelihood method in general?

Linear regression model

Yi = x�iβ + �i

�i iid N(0,σ2)

6/17/13 36

Gaussian Linear regression model

6/17/13 37

Gaussian Linear regression model

Or when X is full rank, OLS is unbiased.

6/17/13 38

Inference in linear regression model

6/17/13 39

Linear regression model: cov of OLS

6/17/13 40

Linear regression model: unbiased estimator of noise variance

6/17/13 41

Unbiased estimator of noise variance

which is an unbiased estimator of noise variance.

6/17/13 42

Generalized LS (GLS)

βWLS = argminβ

n�

i=1

�yi − x�

iβ

σi

�2

6/18/13 43

Gauss-Markov Theorem: OLS is BLUE – best linear unbiased

estimator. (Linear means linear in Y). Read proof in Freedman(2005) (Statistical models: Theory and practice, page 62). Given any \theta = linear combination of \beta and another unbiased linear estimator. The OLS plug-in has a smaller variance and is the only one with the minimum

variance among all linear unbiased estimator

BLUE property of OLS

6/17/13 44

Inference for beta: Formally, assess variability/uncertainty in the estimates and

obtain confidence intervals and hypothesis testing. (Review of concepts of confidence intervals and hypothesis

testing) What for? Both concepts are used in a recent egg yolk study

Inference in linear regression

Egg yolk or not?

6/17/13 45

Paper abstract from Spence et al (2012) in Atherosclerosis

6/17/13 46

6/17/13 47

Identify important associative factors/predictors with the response variable ABOVE noise level inherit in data.

Look at standardized estimates and use probability under an

appropriate distribution. As discussed before, the standardized estimate is a universal currency across different problems to set the cut-offs because of the universality of probability.

Goal of inference:

6/17/13 48

Linear regression models have been used to adjust for other factors in a randomized study (under the Neyman-Rubin model, more in Madigan’s lectures) and it can be used to suggest possible important associative factors/predictors with the response variable.

For observational studies, the concept of design is still very important – how data is collected determines the strength of conclusions from an observational study.

See D. Rubin (2008) “For objective causal inference, design trumps analsyis” Annals of Applied Statistics, 808-840.

Inference is widely used in practice

6/17/13 49

Taguchi’s idea of minimizing variance while aiming at a good mean is behind the

success of Japanese car industry in the 70’s and 80’s. Deming introduced ideas of design of experiments to Japan

after the war, but Fisher’s original design of experiment idea from agriculture is to maximize the mean or yield.

Design of experiment and Japanese car industry

6/18/13 50

Good design reduces estimator variability

6/17/13 51

The signs of estimated coefficients in linear regression are often interpreted: the corresponding predictors are seen as “positive” influence on the response if the sign is positive; and “negative” if the sign is negative. Why is it a good idea or not a good idea to interpret the signs?

Can we interpret the sign of an estimate?

6/17/13 52

In normal linear regression, even if \beta_1 =0, there is a 50% chance that its estimate is positive or negative. If two predictors are co-linear (i.e., with high correlation), and one is in the “true” model, the sign of the other could be quite arbitrary due to colinearity. Stability could be used to weed out instable signs before

attempting at interpretation.

Do not interpret signs with high dependence

6/18/13 53

OLS estimator is exactly multivariate normal with cov (X’X)^{-1} if errors are Gaussian; or asymptotically normal if

X’X/n converges to a positive def. matrix. Bootstrap is also a viable alternative.

More words on confidence interval construction

6/18/13 54

Parametric bootstrap

6/17/13 55

One often wants to find out more than one “important”

predictors by carrying out multiple tests on many predictors’ coefficients.

The problem of multiple testing: when more than one tests are carried out, the real significance level is not controlled by the

set level of 5% or 1% anymore

The problem of multiple testing

Multiple testing

6/17/13 56

}  For a regression model (e.g. egg-yolk study), we are interested

in finding “important” predictors. (Another example is to look

for genes well differentiated between cancer and non-cancer patients.) Multiple tests are carried out.

Apply t-tests to test p beta’s being zero at level alpha. If

X is orthogonal, the probability that at least one test is rejected

is very high even for small p’s such as 2, 3, 4 at 5% level, EVEN IF all null hypotheses are true: 0.9975, 0.9998, 0.99994

Multiple testing

6/17/13 57

}  Bonferroni correction: carry out test at level alpha/p

to control familywise type I error rate at alpha, via union bound.

Reviews on multiple testing:

}  MULTIPLE HYPOTHESIS TESTING - Annual Reviews

by JP Shaffer – 1995

sci2s.ugr.es/keel/pdf/specific/articulo/Shaffer95MHT.pdf

}  Dudoit et al (2003) on multiple testing and microarray.

http://www.jstor.org/stable/3182872

False Discovery Rate (FDR)

6/17/13 58

More recently, False discovery rate has been popular in multiple testing literature and practice:

FDR = number of false discoveries/number of discoveries

“Discovery” usually means “rejection of a null hypothesis”

For example, a gene is differentially expressed in the cancer patients than in the non-cancer patients – a discovery!

Control FDR at level alpha

6/17/13 59

} 

Simes (1986) (also Benjamini and Hochberg, 1995)

1.  Order observed p-values p(1), …, p(m).

2.  Find max k such that p(k)< alpha k/m

3.  Reject all hypotheses j with p(j) < p(k); if no such k, reject nothing.

Control FDR

6/17/13 60

BH (1995) shows that the Simes-BH procedure controls

the FDR in expectation alpha.

This procedure is less conservative than Bonferroni correction –

it rejects more.

Still active research, especially under dependent hypotheses, and many variations have appeared. For a short review, see

Storey JD. (2010) False discovery rates. In International Encyclopedia of Statistical Science, Lovric M (editor).

http://www.genomine.org/publications.html

Tutorial on FDR slides by Chris Genovese:

www.stat.cmu.edu/~genovese/talks/hannover1-04.pdf

6/18/13 61

Prediction on true test data (CV is not reliable if sample size is small relative to model complexity)

Residual plots (Weisberg and Cook’s book on applied regression)

Stability analysis: perturb the data in an appropriate way

Simulation studies (simulated from the fitted model)

Downstream outcome (what the model is used for)

MODEL ASSESSMENTS

6/18/13 62

Standardize residuals and plot them against different predictors

and the fitted values. Look for patterns, trends which are suggestive of

transformations to predictors to get better fitting, if p is not large.

When p is large, it is hard to see things in residual plots, but plotting residuals against fitted values could still be very helpful.

Residual plots to check normality of errors

6/18/13 63

To reduce the influences of the outliers, L1 loss could be used instead of L2 and this method is called LAD (Least Absolute

Deviation). Other loss function can be used as well such as the Huber loss

function that combines L1 with L2 loss – near zero, it is L2 and far away from zero, it is L1. (L1 = absolute value loss function, L2 = quadratic loss function)

M-estimation: fitting with a general loss function. Computation: fit via iteratively re-weighted LS.

Robust regression: M-estimation (Huber, 1964)

6/17/13 64

Y’s are 0’s and 1’s: challenger data, MISR data Task: relate predictors x’s with Y (1) prediction (IT sector, banking, etc) (2) interpretation: what are the important predictors and they are suggestive for interventions for causal inference later.

Classification: supervised learning

Documents

Applied Statistics and Machine Learning€¦ · 2. Random split CV: randomly select (1-α) × 100 % data units for training and use the remaining α × 100 % data units for testing