48
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2015

ECS289: Scalable Machine Learning › ~chohsieh › teaching › ECS289G_Fall2015 › lecture2.pdfControl Fto be a subset of linear functions: 1 Ridge regression: F= ff(x) = wTx jkwk2

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • ECS289: Scalable Machine Learning

    Cho-Jui HsiehUC Davis

    Sept 29, 2015

  • Outline

    Linear regression

    Ridge regression and Lasso

    Time complexity (closed form solution)

    Iterative Solvers

  • Regression

    Input: training data x1, x2, . . . , xn ∈ Rd and corresponding outputsy1, y2, . . . , yn ∈ RTraining: compute a function f such that f (xi ) ≈ yi for all iPrediction: given a testing sample x̃ , predict the output as f (x̃)Examples:

    Income, number of children ⇒ Consumer spendingProcesses, memory ⇒ Power consumptionFinancial reports ⇒ RiskAtmospheric conditions ⇒ Precipitation

  • Linear Regression

    Assume f (·) is a linear function parameterized by w ∈ Rd :f (x) = wTx

    Training: compute the model w ∈ Rd such that wTxi ≈ yi for all iPrediction: given a testing sample x̃ , the prediction value is wT x̃How to find w?

    w∗ = argminw∈Rd

    n∑i=1

    (wTxi − yi )2

  • Linear Regression: probability interpretation (*)

    Assume the data is generated from the probability model:

    yi ∼ wTxi + �i , �i ∼ N (0, 1)Maximum likelihood estimator:

    w∗ = argmaxw

    logP(y1, . . . , yn | x1, . . . , xn,w)

    = argmaxw

    n∑i=1

    logP(yi | xi ,w)

    = argmaxw

    n∑i=1

    log(1√2π

    e−(wT xi−yi )2/2)

    = argmaxw

    n∑i=1

    −12

    (wTxi − yi )2 + constant

    = argminw

    n∑i=1

    (wTxi − yi )2

  • Linear Regression: written as a matrix form

    Linear regression: w∗ = argminw∈Rd∑n

    i=1(wTxi − yi )2

    Matrix form: let X ∈ Rn×d be the matrix where the i-th row is xi ,y = [y1, . . . , yn]T , then linear regression can be written as

    w∗ = argminw∈Rd

    ‖Xw − y‖22

  • Data Structure

  • Dense Matrix vs Sparse Matrix

    Any matrix X ∈ Rm×n can be dense or sparseDense Matrix: most entries in X are nonzero (mn space)

    Sparse Matrix: only few entries in X are nonzero (O(nnz) space)

  • Dense Matrix Operations

    Let A ∈ Rm×n, B ∈ Rm×n, s ∈ RA + B, sA,AT : mn operations

    Let A ∈ Rm×n, b ∈ Rn×1

    Ab: mn operations

  • Dense Matrix Operations

    Matrix-matrix multiplication: let A ∈ Rm×k , B ∈ Rk×n, what is thetime complexity of computing AB?

  • Dense Matrix Operations

    Assume A,B ∈ Rn×n, what is the time complexity of computing AB?Naive implementation: O(n3)

    Theoretical best: O(n2.xxx) (but slower than naive implementation inpractice)

    Best way: using BLAS (Basic Linear Algebra Subprograms)

  • Dense Matrix Operations

    BLAS matrix product: O(mnk) for computing AB whereA ∈ Rm×k ,B ∈ Rk×n

    Compute matrix product block by block to minimize cache miss rate

    Can be called from C, Fortran; can be used in MATLAB, R, Python, . . .

    Three levels of BLAS operations:1 Level 1: Vector operations, e.g., y = αx + y2 Level 2: Matrix-Vector operations, e.g., y = αAx + βy3 Level 3: Matrix-Matrix operations, e.g., y = αAB + βC

  • BLAS (*)

  • Sparse Matrix Operations

    Widely-used format: Compressed Sparse Column (CSC), CompressedSparse Row (CSR), . . .

    CSC: three arrays for storing an m × n matrix with nnz nonzeroes1 val (nnz real numbers): the values of each nonzero elements2 row ind (nnz integers): the row indices corresponding to the values3 col ptr (n + 1 integers): the list of value indexes where each column

    starts

  • Sparse Matrix Operations

    Widely-used format: Compressed Sparse Column (CSC), CompressedSparse Row (CSR), . . .

    CSR: three arrays for storing an m × n matrix with nnz nonzeroes1 val (nnz real numbers): the values of each nonzero elements2 col ind (nnz integers): the column indices corresponding to the values3 row ptr (m + 1 integers): the list of value indexes where each row starts

  • Sparse Matrix Operations

    If A ∈ Rm×n (sparse), B ∈ Rm×n(sparse or dense), s ∈ RA + B, sA,AT : O(nnz) operations

    If A ∈ Rm×n, b ∈ Rn×1

    Ab: O(nnz) operationsIf A ∈ Rm×k (sparse), B ∈ Rk×n (dense):AB: O((nnz)n) operations (use sparse BLAS)

    If A ∈ Rm×k (sparse), B ∈ Rk×n (sparse):AB: O(nnz(A)nnz(B)/k) in average

    AB: O(nnz(A)n) worst case

    The resulting matrix will be much denser

  • Closed Form Solution

  • Solving Linear Regression

    Minimize the sum of squared error J(w)

    J(w) =1

    2‖Xw − y‖2

    =1

    2(Xw − y)T (Xw − y)

    =1

    2wTXTXw − yTXw + 1

    2yTy

    Derivative: ∂∂w J(w) = XTXw − XTy

    Setting the derivative equal to zero gives the normal equation

    XTXw∗ = XTy

    Therefore, w∗ = (XTX )−1XTy

    (but XTX may not be invertible. . . )

  • Solving Linear Regression

    Minimize the sum of squared error J(w)

    J(w) =1

    2‖Xw − y‖2

    =1

    2(Xw − y)T (Xw − y)

    =1

    2wTXTXw − yTXw + 1

    2yTy

    Derivative: ∂∂w J(w) = XTXw − XTy

    Setting the derivative equal to zero gives the normal equation

    XTXw∗ = XTy

    Therefore, w∗ = (XTX )−1XTy(but XTX may not be invertible. . . )

  • Solving Linear Regression

    Normal equation: XTXw∗ = XTyIf XTX is invertible (typically when # samples > # features):

    w∗ = (XTX )−1y

    If XTX is low-rank (typically when # features > # samples):infinite number of solutions (why?)

    Least norm solution: w∗ = X †y where X † is the pseudo inverse

  • Regularized Linear Regression

  • Overfitting

    Overfitting: the model has low training error but high prediction error.

    Using too many features can lead to overfitting

  • Avoid Overfitting by Controlling Model Complexity (*)

    In the following, we derive a bound for generalization (prediction) error

    Training & testing data are generated from the same distribution

    (xi , yi ) ∼ D

    Generalization error (expectation of prediction error):

    R(f ) := E(x ,y)∼D [(f (x)− y)2]

    Best model (minimizing prediction error): f ∗ := argminf ∈F R(f )

    Empirical error: error on the training data

    R̂(f̂ ) :=1

    n

    n∑i=1

    (f (xi )− yi )2

    Our estimator: f̂ := argminf ∈F R̂(f )F : a class of function (e.g., all linear functions {f : f (x) = wTx})

  • Avoid Overfitting by Controlling Model Complexity (*)

    Generalization error of f̂ :

    R(f̂ ) =R(f ∗) + (R̂(f̂ )− R̂(f ∗)) + (R(f̂ )− R̂(f̂ )) + (R̂(f ∗)− R(f ∗))≤R(f ∗) + 0 + 2max

    f ∈F|R̂(f )− R(f )|

    Overfitting is due to large maxf ∈F |R̂(f )− R(f )|How to make this term smaller?

    1 Have more samples (larger n)2 Make F a smaller set (regularization)

  • Avoid overfitting by Controlling Model Complexity (*)

    R(f̂ )︸︷︷︸generalization error of the estimator

    ≤ R(f ∗) + 2 maxf ∈F|R̂(f )− R(f )|

    Control F to be a subset of linear functions:1 Ridge regression: F = {f (x) = wTx | ‖w‖2 ≤ K} where K is a constant2 Lasso: F = {f (x) = wTx | ‖w‖1 ≤ K} where K is a constant

    If K is large ⇒ overfitting (maxf ∈F |R̂(f )− R(f )| is large)If K is small ⇒ underfitting (R(f ∗) is large because F does not coverthe best solution)Adding a constraint is equivalent to adding a regularization term:

    argminw

    n∑i=1

    ‖wTxi − yi‖2 s.t. ‖w‖2 ≤ K

    is equivalent to the following problem with some λ

    argminw

    n∑i=1

    ‖wTxi − yi‖2 + λ‖w‖2

  • Regularized Linear Regression

    Regularized Linear Regression:

    argminw‖Xw − y‖2 + R(w)

    R(w): regularizationRidge Regression (`2 regularization):

    argminw‖Xw − y‖2 + λ‖w‖2

    Lasso (`1 regularization):

    argminw‖Xw − y‖2 + λ‖w‖1

    Note that ‖w‖1 =∑d

    i=1 |wi |

  • Regularization

    Lasso: the solution is sparse, but no closed form solution

  • Ridge Regression

    Ridge regression: argminw∈Rd1

    2‖Xw − y‖2 + λ

    2‖w‖2︸ ︷︷ ︸

    J(w)

    Closed form solution: optimal solution w∗ satisfies ∇J(w∗) = 0:

    XTXw∗ − XTy + λw∗ = 0(XTX + λI )w∗ = XTy

    Optimial solution: w∗ = (XTX + λI )−1XTyInverse always exists because XTX + λI is positive definite

    What’s the computational complexity?

  • Time Complexity (Closed Form Solution)

  • Time Complexity (closed form solution of ridge regression)

    Goal: compute (XTX + λI )−1(XTy)

    Step 1: Compute A = XTX + λI and b = XTywhat’s the time complexity? O(nd2 + nd) if X is dense

    Step 2: several options with O(d3) time complexity1 Compute A−1, and then compute w∗ = A−1b2 Directly solve the linear system Aw = b

    (calling underlying LAPACK routine)(Cholesky factorization)

    Which one is better?

  • Time Complexity (closed form solution)

    Time Complexity for step 2: O(d3)

  • More on Linear System Solvers

    Can be done by linear system solver: computing w such that Aw = yDifferent libraries can be used:

    Default LAPACKIntel Math Kernel Library (Intel MKL)AMD Core Math Library (ACML)

    Parallel numerical linear algebra packages are available today:

    ScaLAPACK (SCAlable Linear Algebra PACKage): linear algebra routinesfor distributed memory machinesPLAPACK (Parallel Linear Algebra PACKage): dense linear algebraroutinesPLASMA (Parallel Linear Algebra Software for Multicore Architectures):combine state-of-the-art solutions in parallel algorithms and schedulingfor optimized solutions of linear systems, eigenvalue problems, . . .

    However, still O(d3) time complexity to solve Aw = y when A ∈ Rd×d

  • Time Complexity

    When X is dense:

    Closed form solution requires O(nd2 + d3) if X is denseEfficient if d is very smallRuns forever when d > 100, 000

    Typical case for big data applications:

    X ∈ Rn×d is sparse with large n and large dHow can we solve the problem?

  • Iterative Solver (Coordinate Descent)

  • Optimization: iterative solver

    Iterative solver: generate a sequence of (improving) approximatesolutions for a problem.

    w0(initialpoint)→ w1 → w2 → w3 → . . . . . .

  • Convergence of the iterative algorithm

    Convergence properties:

    Will the algorithm converge to a point?

    Will it converge to optimal solution(s)?

    What’s the convergence rate (how fast will the sequence converge)?

    We will just introduce existing properties in this class, but we will focus on:

    computational complexity per iteration

  • Coordinate Descent

    Solve the optimization problem:

    argminw

    J(w)

    Update one variable at a timeObtain a model with reasonable performance for a few iterationsRandomized coordinate descent: pick a random coordinate to updateat each iteration

  • Coordinate Descent

    Input: X ∈ RN×d , y ∈ RN , initial w (0)Output: Solution w1: t = 02: while not converged do3: Randomly choose a variable wj4: Compute the optimal update on wj by solving

    δ∗ = argminδ

    J(w + δej)− J(w).

    5: Update wj : wj ← wj + δ∗6: t ← t + 17: end while

  • Coordinate Descent

    Input: X ∈ RN×d , y ∈ RN , initial w (0)Output: Solution w1: t = 02: while not converged do3: Randomly choose a variable wj4: Compute the optimal update on wj by solving

    δ∗ = argminδ

    J(w + δej)− J(w).

    5: Update wj : wj ← wj + δ∗6: t ← t + 17: end while

    Q: What is the exact CD rule for Ridge Regression?

  • Coordinate Descent

    Input: X ∈ RN×d , y ∈ RN , initial w (0)Output: Solution w1: t = 02: while not converged do3: Randomly choose a variable wj4: Compute the optimal update on wj by solving

    δ∗ = argminδ

    J(w + δej)− J(w).

    5: Update wj : wj ← wj + δ∗6: t ← t + 17: end while

    Q: What is the exact CD rule for Ridge Regression?

    δ∗ = −∑n

    i=1 Xij(xTi w − yi ) + λwj

    λ+∑n

    i=1 X2ij

  • Coordinate Descent (convergence)

    If J(w) is strongly convex, randomized coordinate converges to theglobal optimum with a global linear convergence rate.

    For ridge regression, J(w) is strongly convex because

    ∇2J(w) = XTX + λI

    Will show more details in next class

  • Time Complexity

    For updating wj , we need to compute

    δ∗ = −∑n

    i=1 Xij(xTi w − yi ) + λwj

    λ+∑n

    i=1 X2ij

    What’s the computational complexity?

  • Time Complexity

    For updating wj , we need to compute

    δ∗ = −∑n

    i=1 Xij(xTi w − yi ) + λwj

    λ+∑n

    i=1 X2ij

    Assume X ∈ Rn×d is a sparse matrix; each column of X has ni nonzeroentries (

    ∑i ni = nnz(X ))

    A naive approach:

    For each coordinate update (wj)1 Compute ri := xTi w − yi for all i : O(nnz(X )) operations2 Compute hj :=

    ∑ni=1 X

    2ij : O(nj) operations

    3 Compute δ∗ = −∑n

    i=1 riXij+λwjλ+hj

    : O(nj) operations4 wj ← wj + δ∗

  • Time Complexity

    For updating wj , we need to compute

    δ∗ = −∑n

    i=1 Xij(xTi w − yi ) + λwj

    λ+∑n

    i=1 X2ij

    Assume X ∈ Rn×d is a sparse matrix; each column of X has ni nonzeroentries (

    ∑i ni = nnz(X ))

    Precompute hj :=∑n

    i=1 X2ij for all j = 1, . . . , d , O(nnz(X )) operations

    For each coordinate update (wj):1 Compute ri := xTi w − yi for all i : O(nnz(X )) operations2 Compute δ∗ = −

    ∑ni=1 riXij+λwj

    λ+hj: O(nj) operations

    3 wj ← wj + δ∗

  • Time Complexity

    For updating wj , we need to compute

    δ∗ = −∑n

    i=1 Xij(xTi w − yi ) + λwj

    λ+∑n

    i=1 X2ij

    Assume X ∈ Rn×d is a sparse matrix; each column of X has ni nonzeroentries (

    ∑i ni = nnz(X ))

    Precompute hj :=∑n

    i=1 X2ij for all j = 1, . . . , d , O(nnz(X )) operations

    Precompute ri := xTi w − yi for all i , O(nnz(X ))operationsFor each coordinate update (wj):

    1 Compute δ∗ = −∑n

    i=1 riXij+λwjλ+hj

    : O(nj) operations2 For all i = 1, 2, . . . , n, update ri ← ri + δ∗Xij . O(nj) operations3 wj ← wj + δ∗

  • Time Complexity

    Averagely O(n̄) per coordinate update, where n̄ = nnz(X )/d

    T outer iterations, each outer iteration d coordinate updates:

    T × O(nnz(X )) operationsClosed form solution: O(nnz(X )d + d3) operations

    Can be easily extended to solve the LASSO problem

  • Other Iterative Solvers

    Other iterative solvers for ridge regression:

    Gradient Descent

    Stochastic Gradient Descent

    . . . . . .

  • Coming up

    Next class: intro to optimization

    Questions?