Section 1 - Jonathan Krause CS 231A Section 1: Linear Algebra & Probability Review Jonathan Krause 9/28/20121

Section 1 - Jonathan Krause

CS 231A Section 1: Linear Algebra & Probability Review

Jonathan Krause

9/28/20121


Topics• Support Vector Machines• Boosting

– Viola-Jones face detector• Linear Algebra Review

– Notation– Operations & Properties– Matrix Calculus

• Probability– Axioms– Basic Properties– Bayes Theorem, Chain Rule

9/28/20122


Linear classifiers

0:negative

0:positive

b

b

ii

ii

wxx

wxx

9/28/2012

• Find linear function (hyperplane) to separate positive and negative examples

Which hyperplaneis best?

w, b

3


Support vector machines• Find hyperplane that maximizes the margin between the

positive and negative examples

9/28/2012

MarginSupport vectors

4


Support Vector Machines (SVM)• Wish to perform

binary classification, i.e. find a linear classifier

• Given dataand labelswhere

• When data is linearly separable we can solve the optimization problem to find our linear classifier 9/28/20125


• Datasets that are linearly separable work out great:

• But what if the dataset is just too hard?

• We can map it to a higher-dimensional space:

0 x

0 x

0 x

x2

Nonlinear SVMs

9/28/2012

Slide credit: Andrew Moore

6


Φ: x → φ(x)

Nonlinear SVMs• General idea: the original input space can always be mapped

to some higher-dimensional feature space where the training set is separable:

9/28/2012

Slide credit: Andrew Moorelifting transformation

7


SVM – l1 regularization

• What if data is not linearly separable?

• Can use regularization to solve this problem

• We solve a new optimization problem and “tune” our regularization parameter C

9/28/20128


Solving the SVM

• There are many different packages for solving SVMs

• In PS0 we have you use the liblinear package. This is an efficient implementation but can only use a linear kernelhttp://www.csie.ntu.edu.tw/~cjlin/liblinear/

• If you wish to have more flexibility with your choice of kernel you can use the LibSVM packagehttp://www.csie.ntu.edu.tw/~cjlin/libsvm/

9/28/20129






9/28/201210


Each data point has

a class label:

wt =1and a weight:

+1 ( )

-1 ( )yt =

Boosting

• It is a sequential procedure:

9/28/2012

xt=1

xt=2

xt

Y. Freund and R. Schapire, A short introduction to boosting, Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September, 1999.

11

http://www.cs.princeton.edu/~schapire/uncompress-papers.cgi/FreundSc99.ps


Toy example

9/28/2012

Weak learners from the family of lines

h => p(error) = 0.5 it is at chance

Each data point has

a class label:

wt =1and a weight:

+1 ( )

-1 ( )yt =

12


Toy example

9/28/2012

This one seems to be the best

This is a ‘weak classifier’: It performs slightly better than chance.

Each data point has

a class label:

wt =1and a weight:

+1 ( )

-1 ( )yt =

13


Toy example

9/28/2012

Each data point has

a class label:

wt wt exp{-yt Ht}

We update the weights:

+1 ( )

-1 ( )yt =

14


Toy example

9/28/2012

Each data point has

a class label:

wt wt exp{-yt Ht}


+1 ( )

-1 ( )yt =

15


Toy example

9/28/2012

Each data point has

a class label:

wt wt exp{-yt Ht}


+1 ( )

-1 ( )yt =

16


Toy example

9/28/2012

Each data point has

a class label:

wt wt exp{-yt Ht}


+1 ( )

-1 ( )yt =

17


Toy example

9/28/2012

The strong (nonlinear) classifier is built as the combination of all the weak (linear) classifiers.

f1 f2

f3

f4

18


• Defines a classifier using an additive model:

Boosting

9/28/2012

Strong classifier

Weak classifier

WeightFeaturesvector

19


• Defines a classifier using an additive model:

Boosting

• We need to define a family of weak classifiers

9/28/2012

Strong classifier

Weak classifier

WeightFeaturesvector

form a family of weak classifiers

20


Why boosting?

• A simple algorithm for learning robust classifiers– Freund & Shapire, 1995– Friedman, Hastie, Tibshhirani, 1998

• Provides efficient algorithm for sparse visual feature selection– Tieu & Viola, 2000– Viola & Jones, 2003

• Easy to implement, doesn’t require external optimization tools.

9/28/201221


Boosting - mathematics • Weak learners

9/28/2012

value of rectangle feature

threshold

1 if ( )( )

0 otherwisej j

j

f xh x

1 1

11 ( )

( ) 20 otherwise

T T

t t tt th x

h x

• Final strong classifier

22


Weak classifier

9/28/2012

• 4 kind of Rectangle filters

• Value =

∑ (pixels in white area) – ∑ (pixels in black area)

Credit slide: S. Lazebnik

23


Source

Result

Credit slide: S. Lazebnik

Weak classifier

9/28/201224


…..1( ,1)x

2( ,1)x 3( ,0)x 4( ,0)x

( , )n nx y1

5( ,0)x 6( ,0)x

Weak classifierthreshold

1 if ( )( )

0 otherwisej j

j

f xh x

1. Evaluate each rectangle filter on each example

Viola & Jones algorithm

9/28/2012

0.8 0.7 0.2 0.3 0.8 0.1

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.

25


Viola & Jones algorithm• For a 24x24 detection region,

9/28/2012


26


b. For each feature, j ( )j i j i iiw h x y

1 if ( )( )

0 otherwisej j

j

f xh x

c. Choose the classifier, ht with the lowest error t

1 ( )1, ,

t i ih x yt i t i tw w

1t

tt

2. Select best filter/threshold combination,

,

,1

t it i n

t jj

ww

w

a. Normalize the weights

3. Reweight examples


9/28/2012


27


4. The final strong classifier is

1 1

11 ( )

( ) 20 otherwise

T T

t t tt th x

h x

1logt

t

The final hypothesis is a weighted linear combination of the T hypotheses where the weights are inversely proportional to the training errors


9/28/2012


28


Boosting for face detection

• For each round of boosting:1. Evaluate each rectangle filter on each example2. Select best filter/threshold combination3. Reweight examples

9/28/201229


The implemented system• Training Data

– 5000 faces• All frontal, rescaled to

24x24 pixels– 300 million non-faces

• 9500 non-face images– Faces are normalized

• Scale, translation

• Many variations– Across individuals– Illumination– Pose

9/28/2012


30


System performance

• Training time: “weeks” on 466 MHz Sun workstation• 38 layers, total of 6061 features• Average of 10 features evaluated per window on test set• “On a 700 Mhz Pentium III processor, the face detector can

process a 384 by 288 pixel image in about .067 seconds” – 15 Hz– 15 times faster than previous detector of comparable accuracy

(Rowley et al., 1998)

9/28/2012


31


Output of Face Detector on Test Images

9/28/2012


32






9/28/201233


Linear Algebra in Computer Vision

• Representation– 3D points in the scene– 2D points in the image (Images are matrices)

• Transformations– Mapping 2D to 2D– Mapping 3D to 2D

9/28/201234


Notation

• We adopt the notation for a matrix

which is a real valued matrix with m rows, and n columns

• We adopt the notation for a column vector, and a row vector respectively

9/28/201235


Notation

• To indicate the element in the ith row and jth column of a matrix we use

• Similarly to indicate the ith entry in a vector we use

9/28/201236


Norms

• Intuitively the norm of a vector is the measure of its “length”

• The l2 norm is defined as

in this class we will use the l2 norm unless otherwise noted. Thus we drop the 2 subscript on the norm for convenience.

• Note that

9/28/201237


Linear Independence and Rank

• A set of vectors is linearly independent if no vector in the set can be represented as a linear combination of the remaining vectors in the set

• The rank of a matrix is the maximal number of linearly independent column or rows of a matrix

• • • •

9/28/201238


Range and Nullspace

• The range of a matrix is the span of the columns of the matrix, denoted by the set

• The nullspace of a matrix, is the set of vectors that when multiplied by the matrix result in 0, given by the set

9/28/201239


Eigenvalues and Eigenvectors

• Given a matrix, and are said to be an eigenvalue and the corresponding eigenvector of the matrix if

• We can solve for the eigenvalues by solving for the roots of the polynomial generated by

9/28/201240


Eigenvalue Properties

•

• The rank of a matrix is equal to the number of its non-zero eigenvalues

• Eigenvalues of a diagonal matrix, are simply the diagonal entries

• A matrix is said to be diagonalizable if we can write

9/28/201241


Eigenvalues & Eigenvectors of Symmetric Matrices

• Eigenvalues of symmetric matrices are real• Eigenvectors of symmetric matrices are orthonormal• Consider the optimization problem involving the

symmetric matrix

the maximizing is the eigenvector corresponding to the largest eigenvalue

9/28/201242


Generalized Eigenvalues

• Generalized Eigenvalue problem

• Generalized eigenvalues must satisfy

• This reduces to the original eigenvalue problem when exists

• Generalized eigenvalues are used in Fisherfaces 9/28/201243


Singular Value Decomposition (SVD)

• The SVD of matrix is given by

• Where are the columns of and called the left singular vectors is a diagonal matrix whose values are , and called the singular values are the columns of , and are called the right singular vectors

9/28/201244


SVD

• If the matrix has rank , then has non-zero singular values

• are an orthonormal basis for

• are an orthonormal basis for

• Singular values of are the square root of the non-zero eigenvalues of or

9/28/201245


Matlab

• [V,D] = eig(A)The eigenvectors of A are the columns of V. D is a diagonal matrix whose entries are the eigenvalues of A.

• [V,D] = eig(A,B)The generalized eigenvectors are the columns of V. D is a diagonal matrix whose entries of the generalized eigenvalues.

• [U,S,V] = svd(X)The columns of U are the left singular vectors of X. S is a diagonal matrix whose entries are the singular values of X. The columns of V are the right singular vectors of X. Recall X = U*S*V’;

9/28/201246


Matrix Calculus -- Gradient

• Let then the gradient is given by

• is always the same size as , thus if we just have a vector the gradient is simply

9/28/201247


Gradients

• From partial derivatives

• Some common gradients

9/28/201248






9/28/201249


Probability in Computer Vision

• Foundation for algorithms to solve– Tracking problems– Human activity recognition– Object recognition– Segmentation

9/28/201250


Probability Axioms

• Sample space: The set of all the outcomes of a random experiment. Denoted by

• Event space: A set whose elements are subsets of . The event space is denoted by . For example

• Probability measure: A functionthat satisfies– – –

9/28/201251


Basic Properties

• • • • •

9/28/201252


Conditional Probability

•

• Two events are independent if

• Conditional Independence

9/28/201253


Product Rule

• From the definition of conditional probability we can write

• From the product rule we can derive the chain rule of probability

9/28/201254


Bayes Theorem

•

9/28/2012

PriorProbability

Likelihood

Normalizing Constant

Posterior Probability

55

Documents

Section 1 - Jonathan Krause CS 231A Section 1: Linear Algebra & Probability Review Jonathan Krause 9/28/20121