ECS289: Scalable Machine Learning › ~chohsieh › teaching › ECS289G_Fall2015 › lecture2.pdfControl Fto be a subset of linear functions: 1 Ridge regression: F= ff(x) = wTx jkwk2

ECS289: Scalable Machine Learning

Cho-Jui HsiehUC Davis

Sept 29, 2015

Outline

Linear regression

Ridge regression and Lasso

Time complexity (closed form solution)

Iterative Solvers

Regression

Input: training data x1, x2, . . . , xn ∈ Rd and corresponding outputsy1, y2, . . . , yn ∈ RTraining: compute a function f such that f (xi ) ≈ yi for all iPrediction: given a testing sample x̃ , predict the output as f (x̃)Examples:

Income, number of children ⇒ Consumer spendingProcesses, memory ⇒ Power consumptionFinancial reports ⇒ RiskAtmospheric conditions ⇒ Precipitation

Linear Regression

Assume f (·) is a linear function parameterized by w ∈ Rd :f (x) = wTx

Training: compute the model w ∈ Rd such that wTxi ≈ yi for all iPrediction: given a testing sample x̃ , the prediction value is wT x̃How to find w?

w∗ = argminw∈Rd

n∑i=1

(wTxi − yi )2

Linear Regression: probability interpretation (*)

Assume the data is generated from the probability model:

yi ∼ wTxi + �i , �i ∼ N (0, 1)Maximum likelihood estimator:

w∗ = argmaxw

logP(y1, . . . , yn | x1, . . . , xn,w)

= argmaxw

n∑i=1

logP(yi | xi ,w)

= argmaxw

n∑i=1

log(1√2π

e−(wT xi−yi )2/2)

= argmaxw

n∑i=1

−12

(wTxi − yi )2 + constant

= argminw

n∑i=1

(wTxi − yi )2

Linear Regression: written as a matrix form

Linear regression: w∗ = argminw∈Rd∑n

i=1(wTxi − yi )2

Matrix form: let X ∈ Rn×d be the matrix where the i-th row is xi ,y = [y1, . . . , yn]T , then linear regression can be written as

w∗ = argminw∈Rd

‖Xw − y‖22

Data Structure

Dense Matrix vs Sparse Matrix

Any matrix X ∈ Rm×n can be dense or sparseDense Matrix: most entries in X are nonzero (mn space)

Sparse Matrix: only few entries in X are nonzero (O(nnz) space)

Dense Matrix Operations

Let A ∈ Rm×n, B ∈ Rm×n, s ∈ RA + B, sA,AT : mn operations

Let A ∈ Rm×n, b ∈ Rn×1

Ab: mn operations


Matrix-matrix multiplication: let A ∈ Rm×k , B ∈ Rk×n, what is thetime complexity of computing AB?


Assume A,B ∈ Rn×n, what is the time complexity of computing AB?Naive implementation: O(n3)

Theoretical best: O(n2.xxx) (but slower than naive implementation inpractice)

Best way: using BLAS (Basic Linear Algebra Subprograms)


BLAS matrix product: O(mnk) for computing AB whereA ∈ Rm×k ,B ∈ Rk×n

Compute matrix product block by block to minimize cache miss rate

Can be called from C, Fortran; can be used in MATLAB, R, Python, . . .

Three levels of BLAS operations:1 Level 1: Vector operations, e.g., y = αx + y2 Level 2: Matrix-Vector operations, e.g., y = αAx + βy3 Level 3: Matrix-Matrix operations, e.g., y = αAB + βC

BLAS (*)

Sparse Matrix Operations

Widely-used format: Compressed Sparse Column (CSC), CompressedSparse Row (CSR), . . .

CSC: three arrays for storing an m × n matrix with nnz nonzeroes1 val (nnz real numbers): the values of each nonzero elements2 row ind (nnz integers): the row indices corresponding to the values3 col ptr (n + 1 integers): the list of value indexes where each column

starts


Widely-used format: Compressed Sparse Column (CSC), CompressedSparse Row (CSR), . . .

CSR: three arrays for storing an m × n matrix with nnz nonzeroes1 val (nnz real numbers): the values of each nonzero elements2 col ind (nnz integers): the column indices corresponding to the values3 row ptr (m + 1 integers): the list of value indexes where each row starts


If A ∈ Rm×n (sparse), B ∈ Rm×n(sparse or dense), s ∈ RA + B, sA,AT : O(nnz) operations

If A ∈ Rm×n, b ∈ Rn×1

Ab: O(nnz) operationsIf A ∈ Rm×k (sparse), B ∈ Rk×n (dense):AB: O((nnz)n) operations (use sparse BLAS)

If A ∈ Rm×k (sparse), B ∈ Rk×n (sparse):AB: O(nnz(A)nnz(B)/k) in average

AB: O(nnz(A)n) worst case

The resulting matrix will be much denser

Closed Form Solution

Solving Linear Regression

Minimize the sum of squared error J(w)

J(w) =1

2‖Xw − y‖2

=1

2(Xw − y)T (Xw − y)

=1

2wTXTXw − yTXw + 1

2yTy

Derivative: ∂∂w J(w) = XTXw − XTy

Setting the derivative equal to zero gives the normal equation

XTXw∗ = XTy

Therefore, w∗ = (XTX )−1XTy

(but XTX may not be invertible. . . )


Minimize the sum of squared error J(w)

J(w) =1

2‖Xw − y‖2

=1

2(Xw − y)T (Xw − y)

=1

2wTXTXw − yTXw + 1

2yTy

Derivative: ∂∂w J(w) = XTXw − XTy

Setting the derivative equal to zero gives the normal equation

XTXw∗ = XTy

Therefore, w∗ = (XTX )−1XTy(but XTX may not be invertible. . . )


Normal equation: XTXw∗ = XTyIf XTX is invertible (typically when # samples > # features):

w∗ = (XTX )−1y

If XTX is low-rank (typically when # features > # samples):infinite number of solutions (why?)

Least norm solution: w∗ = X †y where X † is the pseudo inverse

Regularized Linear Regression

Overfitting

Overfitting: the model has low training error but high prediction error.

Using too many features can lead to overfitting

Avoid Overfitting by Controlling Model Complexity (*)

In the following, we derive a bound for generalization (prediction) error

Training & testing data are generated from the same distribution

(xi , yi ) ∼ D

Generalization error (expectation of prediction error):

R(f ) := E(x ,y)∼D [(f (x)− y)2]

Best model (minimizing prediction error): f ∗ := argminf ∈F R(f )

Empirical error: error on the training data

R̂(f̂ ) :=1

n

n∑i=1

(f (xi )− yi )2

Our estimator: f̂ := argminf ∈F R̂(f )F : a class of function (e.g., all linear functions {f : f (x) = wTx})

Avoid Overfitting by Controlling Model Complexity (*)

Generalization error of f̂ :

R(f̂ ) =R(f ∗) + (R̂(f̂ )− R̂(f ∗)) + (R(f̂ )− R̂(f̂ )) + (R̂(f ∗)− R(f ∗))≤R(f ∗) + 0 + 2max

f ∈F|R̂(f )− R(f )|

Overfitting is due to large maxf ∈F |R̂(f )− R(f )|How to make this term smaller?

1 Have more samples (larger n)2 Make F a smaller set (regularization)

Avoid overfitting by Controlling Model Complexity (*)

R(f̂ )︸︷︷︸generalization error of the estimator

≤ R(f ∗) + 2 maxf ∈F|R̂(f )− R(f )|

Control F to be a subset of linear functions:1 Ridge regression: F = {f (x) = wTx | ‖w‖2 ≤ K} where K is a constant2 Lasso: F = {f (x) = wTx | ‖w‖1 ≤ K} where K is a constant

If K is large ⇒ overfitting (maxf ∈F |R̂(f )− R(f )| is large)If K is small ⇒ underfitting (R(f ∗) is large because F does not coverthe best solution)Adding a constraint is equivalent to adding a regularization term:

argminw

n∑i=1

‖wTxi − yi‖2 s.t. ‖w‖2 ≤ K

is equivalent to the following problem with some λ

argminw

n∑i=1

‖wTxi − yi‖2 + λ‖w‖2

Regularized Linear Regression

Regularized Linear Regression:

argminw‖Xw − y‖2 + R(w)

R(w): regularizationRidge Regression (`2 regularization):

argminw‖Xw − y‖2 + λ‖w‖2

Lasso (`1 regularization):

argminw‖Xw − y‖2 + λ‖w‖1

Note that ‖w‖1 =∑d

i=1 |wi |

Regularization

Lasso: the solution is sparse, but no closed form solution

Ridge Regression

Ridge regression: argminw∈Rd1

2‖Xw − y‖2 + λ

2‖w‖2︸︷︷︸

J(w)

Closed form solution: optimal solution w∗ satisfies ∇J(w∗) = 0:

XTXw∗ − XTy + λw∗ = 0(XTX + λI )w∗ = XTy

Optimial solution: w∗ = (XTX + λI )−1XTyInverse always exists because XTX + λI is positive definite

What’s the computational complexity?

Time Complexity (Closed Form Solution)

Time Complexity (closed form solution of ridge regression)

Goal: compute (XTX + λI )−1(XTy)

Step 1: Compute A = XTX + λI and b = XTywhat’s the time complexity? O(nd2 + nd) if X is dense

Step 2: several options with O(d3) time complexity1 Compute A−1, and then compute w∗ = A−1b2 Directly solve the linear system Aw = b

(calling underlying LAPACK routine)(Cholesky factorization)

Which one is better?

Time Complexity (closed form solution)

Time Complexity for step 2: O(d3)

More on Linear System Solvers

Can be done by linear system solver: computing w such that Aw = yDifferent libraries can be used:

Default LAPACKIntel Math Kernel Library (Intel MKL)AMD Core Math Library (ACML)

Parallel numerical linear algebra packages are available today:

ScaLAPACK (SCAlable Linear Algebra PACKage): linear algebra routinesfor distributed memory machinesPLAPACK (Parallel Linear Algebra PACKage): dense linear algebraroutinesPLASMA (Parallel Linear Algebra Software for Multicore Architectures):combine state-of-the-art solutions in parallel algorithms and schedulingfor optimized solutions of linear systems, eigenvalue problems, . . .

However, still O(d3) time complexity to solve Aw = y when A ∈ Rd×d

Time Complexity

When X is dense:

Closed form solution requires O(nd2 + d3) if X is denseEfficient if d is very smallRuns forever when d > 100, 000

Typical case for big data applications:

X ∈ Rn×d is sparse with large n and large dHow can we solve the problem?

Iterative Solver (Coordinate Descent)

Optimization: iterative solver

Iterative solver: generate a sequence of (improving) approximatesolutions for a problem.

w0(initialpoint)→ w1 → w2 → w3 → . . . . . .

Convergence of the iterative algorithm

Convergence properties:

Will the algorithm converge to a point?

Will it converge to optimal solution(s)?

What’s the convergence rate (how fast will the sequence converge)?

We will just introduce existing properties in this class, but we will focus on:

computational complexity per iteration

Coordinate Descent

Solve the optimization problem:

argminw

J(w)

Update one variable at a timeObtain a model with reasonable performance for a few iterationsRandomized coordinate descent: pick a random coordinate to updateat each iteration

Coordinate Descent

Input: X ∈ RN×d , y ∈ RN , initial w (0)Output: Solution w1: t = 02: while not converged do3: Randomly choose a variable wj4: Compute the optimal update on wj by solving

δ∗ = argminδ

J(w + δej)− J(w).

5: Update wj : wj ← wj + δ∗6: t ← t + 17: end while

Coordinate Descent


δ∗ = argminδ



Q: What is the exact CD rule for Ridge Regression?

Coordinate Descent


δ∗ = argminδ



Q: What is the exact CD rule for Ridge Regression?

δ∗ = −∑n

i=1 Xij(xTi w − yi ) + λwj

λ+∑n

i=1 X2ij

Coordinate Descent (convergence)

If J(w) is strongly convex, randomized coordinate converges to theglobal optimum with a global linear convergence rate.

For ridge regression, J(w) is strongly convex because

∇2J(w) = XTX + λI

Will show more details in next class

Time Complexity

For updating wj , we need to compute

δ∗ = −∑n


λ+∑n

i=1 X2ij

What’s the computational complexity?

Time Complexity


δ∗ = −∑n


λ+∑n

i=1 X2ij

Assume X ∈ Rn×d is a sparse matrix; each column of X has ni nonzeroentries (

∑i ni = nnz(X ))

A naive approach:

For each coordinate update (wj)1 Compute ri := xTi w − yi for all i : O(nnz(X )) operations2 Compute hj :=

∑ni=1 X

2ij : O(nj) operations

3 Compute δ∗ = −∑n

i=1 riXij+λwjλ+hj

: O(nj) operations4 wj ← wj + δ∗

Time Complexity


δ∗ = −∑n


λ+∑n

i=1 X2ij


∑i ni = nnz(X ))

Precompute hj :=∑n

i=1 X2ij for all j = 1, . . . , d , O(nnz(X )) operations

For each coordinate update (wj):1 Compute ri := xTi w − yi for all i : O(nnz(X )) operations2 Compute δ∗ = −

∑ni=1 riXij+λwj

λ+hj: O(nj) operations

3 wj ← wj + δ∗

Time Complexity


δ∗ = −∑n


λ+∑n

i=1 X2ij


∑i ni = nnz(X ))

Precompute hj :=∑n

i=1 X2ij for all j = 1, . . . , d , O(nnz(X )) operations

Precompute ri := xTi w − yi for all i , O(nnz(X ))operationsFor each coordinate update (wj):

1 Compute δ∗ = −∑n

i=1 riXij+λwjλ+hj

: O(nj) operations2 For all i = 1, 2, . . . , n, update ri ← ri + δ∗Xij . O(nj) operations3 wj ← wj + δ∗

Time Complexity

Averagely O(n̄) per coordinate update, where n̄ = nnz(X )/d

T outer iterations, each outer iteration d coordinate updates:

T × O(nnz(X )) operationsClosed form solution: O(nnz(X )d + d3) operations

Can be easily extended to solve the LASSO problem

Other Iterative Solvers

Other iterative solvers for ridge regression:

Gradient Descent

Stochastic Gradient Descent

. . . . . .

Coming up

Next class: intro to optimization

Questions?

Documents

ECS289: Scalable Machine Learning › ~chohsieh › teaching › ECS289G_Fall2015 › lecture2.pdfControl Fto be a subset of linear functions: 1 Ridge regression: F= ff(x) = wTx jkwk2