Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
ECS289: Scalable Machine Learning
Cho-Jui HsiehUC Davis
Sept 29, 2015
Outline
Linear regression
Ridge regression and Lasso
Time complexity (closed form solution)
Iterative Solvers
Regression
Input: training data x1, x2, . . . , xn ∈ Rd and corresponding outputsy1, y2, . . . , yn ∈ RTraining: compute a function f such that f (xi ) ≈ yi for all iPrediction: given a testing sample x̃ , predict the output as f (x̃)Examples:
Income, number of children ⇒ Consumer spendingProcesses, memory ⇒ Power consumptionFinancial reports ⇒ RiskAtmospheric conditions ⇒ Precipitation
Linear Regression
Assume f (·) is a linear function parameterized by w ∈ Rd :f (x) = wTx
Training: compute the model w ∈ Rd such that wTxi ≈ yi for all iPrediction: given a testing sample x̃ , the prediction value is wT x̃How to find w?
w∗ = argminw∈Rd
n∑i=1
(wTxi − yi )2
Linear Regression: probability interpretation (*)
Assume the data is generated from the probability model:
yi ∼ wTxi + �i , �i ∼ N (0, 1)Maximum likelihood estimator:
w∗ = argmaxw
logP(y1, . . . , yn | x1, . . . , xn,w)
= argmaxw
n∑i=1
logP(yi | xi ,w)
= argmaxw
n∑i=1
log(1√2π
e−(wT xi−yi )2/2)
= argmaxw
n∑i=1
−12
(wTxi − yi )2 + constant
= argminw
n∑i=1
(wTxi − yi )2
Linear Regression: written as a matrix form
Linear regression: w∗ = argminw∈Rd∑n
i=1(wTxi − yi )2
Matrix form: let X ∈ Rn×d be the matrix where the i-th row is xi ,y = [y1, . . . , yn]T , then linear regression can be written as
w∗ = argminw∈Rd
‖Xw − y‖22
Data Structure
Dense Matrix vs Sparse Matrix
Any matrix X ∈ Rm×n can be dense or sparseDense Matrix: most entries in X are nonzero (mn space)
Sparse Matrix: only few entries in X are nonzero (O(nnz) space)
Dense Matrix Operations
Let A ∈ Rm×n, B ∈ Rm×n, s ∈ RA + B, sA,AT : mn operations
Let A ∈ Rm×n, b ∈ Rn×1
Ab: mn operations
Dense Matrix Operations
Matrix-matrix multiplication: let A ∈ Rm×k , B ∈ Rk×n, what is thetime complexity of computing AB?
Dense Matrix Operations
Assume A,B ∈ Rn×n, what is the time complexity of computing AB?Naive implementation: O(n3)
Theoretical best: O(n2.xxx) (but slower than naive implementation inpractice)
Best way: using BLAS (Basic Linear Algebra Subprograms)
Dense Matrix Operations
BLAS matrix product: O(mnk) for computing AB whereA ∈ Rm×k ,B ∈ Rk×n
Compute matrix product block by block to minimize cache miss rate
Can be called from C, Fortran; can be used in MATLAB, R, Python, . . .
Three levels of BLAS operations:1 Level 1: Vector operations, e.g., y = αx + y2 Level 2: Matrix-Vector operations, e.g., y = αAx + βy3 Level 3: Matrix-Matrix operations, e.g., y = αAB + βC
BLAS (*)
Sparse Matrix Operations
Widely-used format: Compressed Sparse Column (CSC), CompressedSparse Row (CSR), . . .
CSC: three arrays for storing an m × n matrix with nnz nonzeroes1 val (nnz real numbers): the values of each nonzero elements2 row ind (nnz integers): the row indices corresponding to the values3 col ptr (n + 1 integers): the list of value indexes where each column
starts
Sparse Matrix Operations
Widely-used format: Compressed Sparse Column (CSC), CompressedSparse Row (CSR), . . .
CSR: three arrays for storing an m × n matrix with nnz nonzeroes1 val (nnz real numbers): the values of each nonzero elements2 col ind (nnz integers): the column indices corresponding to the values3 row ptr (m + 1 integers): the list of value indexes where each row starts
Sparse Matrix Operations
If A ∈ Rm×n (sparse), B ∈ Rm×n(sparse or dense), s ∈ RA + B, sA,AT : O(nnz) operations
If A ∈ Rm×n, b ∈ Rn×1
Ab: O(nnz) operationsIf A ∈ Rm×k (sparse), B ∈ Rk×n (dense):AB: O((nnz)n) operations (use sparse BLAS)
If A ∈ Rm×k (sparse), B ∈ Rk×n (sparse):AB: O(nnz(A)nnz(B)/k) in average
AB: O(nnz(A)n) worst case
The resulting matrix will be much denser
Closed Form Solution
Solving Linear Regression
Minimize the sum of squared error J(w)
J(w) =1
2‖Xw − y‖2
=1
2(Xw − y)T (Xw − y)
=1
2wTXTXw − yTXw + 1
2yTy
Derivative: ∂∂w J(w) = XTXw − XTy
Setting the derivative equal to zero gives the normal equation
XTXw∗ = XTy
Therefore, w∗ = (XTX )−1XTy
(but XTX may not be invertible. . . )
Solving Linear Regression
Minimize the sum of squared error J(w)
J(w) =1
2‖Xw − y‖2
=1
2(Xw − y)T (Xw − y)
=1
2wTXTXw − yTXw + 1
2yTy
Derivative: ∂∂w J(w) = XTXw − XTy
Setting the derivative equal to zero gives the normal equation
XTXw∗ = XTy
Therefore, w∗ = (XTX )−1XTy(but XTX may not be invertible. . . )
Solving Linear Regression
Normal equation: XTXw∗ = XTyIf XTX is invertible (typically when # samples > # features):
w∗ = (XTX )−1y
If XTX is low-rank (typically when # features > # samples):infinite number of solutions (why?)
Least norm solution: w∗ = X †y where X † is the pseudo inverse
Regularized Linear Regression
Overfitting
Overfitting: the model has low training error but high prediction error.
Using too many features can lead to overfitting
Avoid Overfitting by Controlling Model Complexity (*)
In the following, we derive a bound for generalization (prediction) error
Training & testing data are generated from the same distribution
(xi , yi ) ∼ D
Generalization error (expectation of prediction error):
R(f ) := E(x ,y)∼D [(f (x)− y)2]
Best model (minimizing prediction error): f ∗ := argminf ∈F R(f )
Empirical error: error on the training data
R̂(f̂ ) :=1
n
n∑i=1
(f (xi )− yi )2
Our estimator: f̂ := argminf ∈F R̂(f )F : a class of function (e.g., all linear functions {f : f (x) = wTx})
Avoid Overfitting by Controlling Model Complexity (*)
Generalization error of f̂ :
R(f̂ ) =R(f ∗) + (R̂(f̂ )− R̂(f ∗)) + (R(f̂ )− R̂(f̂ )) + (R̂(f ∗)− R(f ∗))≤R(f ∗) + 0 + 2max
f ∈F|R̂(f )− R(f )|
Overfitting is due to large maxf ∈F |R̂(f )− R(f )|How to make this term smaller?
1 Have more samples (larger n)2 Make F a smaller set (regularization)
Avoid overfitting by Controlling Model Complexity (*)
R(f̂ )︸︷︷︸generalization error of the estimator
≤ R(f ∗) + 2 maxf ∈F|R̂(f )− R(f )|
Control F to be a subset of linear functions:1 Ridge regression: F = {f (x) = wTx | ‖w‖2 ≤ K} where K is a constant2 Lasso: F = {f (x) = wTx | ‖w‖1 ≤ K} where K is a constant
If K is large ⇒ overfitting (maxf ∈F |R̂(f )− R(f )| is large)If K is small ⇒ underfitting (R(f ∗) is large because F does not coverthe best solution)Adding a constraint is equivalent to adding a regularization term:
argminw
n∑i=1
‖wTxi − yi‖2 s.t. ‖w‖2 ≤ K
is equivalent to the following problem with some λ
argminw
n∑i=1
‖wTxi − yi‖2 + λ‖w‖2
Regularized Linear Regression
Regularized Linear Regression:
argminw‖Xw − y‖2 + R(w)
R(w): regularizationRidge Regression (`2 regularization):
argminw‖Xw − y‖2 + λ‖w‖2
Lasso (`1 regularization):
argminw‖Xw − y‖2 + λ‖w‖1
Note that ‖w‖1 =∑d
i=1 |wi |
Regularization
Lasso: the solution is sparse, but no closed form solution
Ridge Regression
Ridge regression: argminw∈Rd1
2‖Xw − y‖2 + λ
2‖w‖2︸ ︷︷ ︸
J(w)
Closed form solution: optimal solution w∗ satisfies ∇J(w∗) = 0:
XTXw∗ − XTy + λw∗ = 0(XTX + λI )w∗ = XTy
Optimial solution: w∗ = (XTX + λI )−1XTyInverse always exists because XTX + λI is positive definite
What’s the computational complexity?
Time Complexity (Closed Form Solution)
Time Complexity (closed form solution of ridge regression)
Goal: compute (XTX + λI )−1(XTy)
Step 1: Compute A = XTX + λI and b = XTywhat’s the time complexity? O(nd2 + nd) if X is dense
Step 2: several options with O(d3) time complexity1 Compute A−1, and then compute w∗ = A−1b2 Directly solve the linear system Aw = b
(calling underlying LAPACK routine)(Cholesky factorization)
Which one is better?
Time Complexity (closed form solution)
Time Complexity for step 2: O(d3)
More on Linear System Solvers
Can be done by linear system solver: computing w such that Aw = yDifferent libraries can be used:
Default LAPACKIntel Math Kernel Library (Intel MKL)AMD Core Math Library (ACML)
Parallel numerical linear algebra packages are available today:
ScaLAPACK (SCAlable Linear Algebra PACKage): linear algebra routinesfor distributed memory machinesPLAPACK (Parallel Linear Algebra PACKage): dense linear algebraroutinesPLASMA (Parallel Linear Algebra Software for Multicore Architectures):combine state-of-the-art solutions in parallel algorithms and schedulingfor optimized solutions of linear systems, eigenvalue problems, . . .
However, still O(d3) time complexity to solve Aw = y when A ∈ Rd×d
Time Complexity
When X is dense:
Closed form solution requires O(nd2 + d3) if X is denseEfficient if d is very smallRuns forever when d > 100, 000
Typical case for big data applications:
X ∈ Rn×d is sparse with large n and large dHow can we solve the problem?
Iterative Solver (Coordinate Descent)
Optimization: iterative solver
Iterative solver: generate a sequence of (improving) approximatesolutions for a problem.
w0(initialpoint)→ w1 → w2 → w3 → . . . . . .
Convergence of the iterative algorithm
Convergence properties:
Will the algorithm converge to a point?
Will it converge to optimal solution(s)?
What’s the convergence rate (how fast will the sequence converge)?
We will just introduce existing properties in this class, but we will focus on:
computational complexity per iteration
Coordinate Descent
Solve the optimization problem:
argminw
J(w)
Update one variable at a timeObtain a model with reasonable performance for a few iterationsRandomized coordinate descent: pick a random coordinate to updateat each iteration
Coordinate Descent
Input: X ∈ RN×d , y ∈ RN , initial w (0)Output: Solution w1: t = 02: while not converged do3: Randomly choose a variable wj4: Compute the optimal update on wj by solving
δ∗ = argminδ
J(w + δej)− J(w).
5: Update wj : wj ← wj + δ∗6: t ← t + 17: end while
Coordinate Descent
Input: X ∈ RN×d , y ∈ RN , initial w (0)Output: Solution w1: t = 02: while not converged do3: Randomly choose a variable wj4: Compute the optimal update on wj by solving
δ∗ = argminδ
J(w + δej)− J(w).
5: Update wj : wj ← wj + δ∗6: t ← t + 17: end while
Q: What is the exact CD rule for Ridge Regression?
Coordinate Descent
Input: X ∈ RN×d , y ∈ RN , initial w (0)Output: Solution w1: t = 02: while not converged do3: Randomly choose a variable wj4: Compute the optimal update on wj by solving
δ∗ = argminδ
J(w + δej)− J(w).
5: Update wj : wj ← wj + δ∗6: t ← t + 17: end while
Q: What is the exact CD rule for Ridge Regression?
δ∗ = −∑n
i=1 Xij(xTi w − yi ) + λwj
λ+∑n
i=1 X2ij
Coordinate Descent (convergence)
If J(w) is strongly convex, randomized coordinate converges to theglobal optimum with a global linear convergence rate.
For ridge regression, J(w) is strongly convex because
∇2J(w) = XTX + λI
Will show more details in next class
Time Complexity
For updating wj , we need to compute
δ∗ = −∑n
i=1 Xij(xTi w − yi ) + λwj
λ+∑n
i=1 X2ij
What’s the computational complexity?
Time Complexity
For updating wj , we need to compute
δ∗ = −∑n
i=1 Xij(xTi w − yi ) + λwj
λ+∑n
i=1 X2ij
Assume X ∈ Rn×d is a sparse matrix; each column of X has ni nonzeroentries (
∑i ni = nnz(X ))
A naive approach:
For each coordinate update (wj)1 Compute ri := xTi w − yi for all i : O(nnz(X )) operations2 Compute hj :=
∑ni=1 X
2ij : O(nj) operations
3 Compute δ∗ = −∑n
i=1 riXij+λwjλ+hj
: O(nj) operations4 wj ← wj + δ∗
Time Complexity
For updating wj , we need to compute
δ∗ = −∑n
i=1 Xij(xTi w − yi ) + λwj
λ+∑n
i=1 X2ij
Assume X ∈ Rn×d is a sparse matrix; each column of X has ni nonzeroentries (
∑i ni = nnz(X ))
Precompute hj :=∑n
i=1 X2ij for all j = 1, . . . , d , O(nnz(X )) operations
For each coordinate update (wj):1 Compute ri := xTi w − yi for all i : O(nnz(X )) operations2 Compute δ∗ = −
∑ni=1 riXij+λwj
λ+hj: O(nj) operations
3 wj ← wj + δ∗
Time Complexity
For updating wj , we need to compute
δ∗ = −∑n
i=1 Xij(xTi w − yi ) + λwj
λ+∑n
i=1 X2ij
Assume X ∈ Rn×d is a sparse matrix; each column of X has ni nonzeroentries (
∑i ni = nnz(X ))
Precompute hj :=∑n
i=1 X2ij for all j = 1, . . . , d , O(nnz(X )) operations
Precompute ri := xTi w − yi for all i , O(nnz(X ))operationsFor each coordinate update (wj):
1 Compute δ∗ = −∑n
i=1 riXij+λwjλ+hj
: O(nj) operations2 For all i = 1, 2, . . . , n, update ri ← ri + δ∗Xij . O(nj) operations3 wj ← wj + δ∗
Time Complexity
Averagely O(n̄) per coordinate update, where n̄ = nnz(X )/d
T outer iterations, each outer iteration d coordinate updates:
T × O(nnz(X )) operationsClosed form solution: O(nnz(X )d + d3) operations
Can be easily extended to solve the LASSO problem
Other Iterative Solvers
Other iterative solvers for ridge regression:
Gradient Descent
Stochastic Gradient Descent
. . . . . .
Coming up
Next class: intro to optimization
Questions?