Upload
megan-booker
View
220
Download
1
Embed Size (px)
Citation preview
Section 1 - Jonathan Krause
CS 231A Section 1: Linear Algebra & Probability Review
Jonathan Krause
9/28/20121
Section 1 - Jonathan Krause
Topics• Support Vector Machines• Boosting
– Viola-Jones face detector• Linear Algebra Review
– Notation– Operations & Properties– Matrix Calculus
• Probability– Axioms– Basic Properties– Bayes Theorem, Chain Rule
9/28/20122
Section 1 - Jonathan Krause
Linear classifiers
0:negative
0:positive
b
b
ii
ii
wxx
wxx
9/28/2012
• Find linear function (hyperplane) to separate positive and negative examples
Which hyperplaneis best?
w, b
3
Section 1 - Jonathan Krause
Support vector machines• Find hyperplane that maximizes the margin between the
positive and negative examples
9/28/2012
MarginSupport vectors
4
Section 1 - Jonathan Krause
Support Vector Machines (SVM)• Wish to perform
binary classification, i.e. find a linear classifier
• Given dataand labelswhere
• When data is linearly separable we can solve the optimization problem to find our linear classifier 9/28/20125
Section 1 - Jonathan Krause
• Datasets that are linearly separable work out great:
• But what if the dataset is just too hard?
• We can map it to a higher-dimensional space:
0 x
0 x
0 x
x2
Nonlinear SVMs
9/28/2012
Slide credit: Andrew Moore
6
Section 1 - Jonathan Krause
Φ: x → φ(x)
Nonlinear SVMs• General idea: the original input space can always be mapped
to some higher-dimensional feature space where the training set is separable:
9/28/2012
Slide credit: Andrew Moorelifting transformation
7
Section 1 - Jonathan Krause
SVM – l1 regularization
• What if data is not linearly separable?
• Can use regularization to solve this problem
• We solve a new optimization problem and “tune” our regularization parameter C
9/28/20128
Section 1 - Jonathan Krause
Solving the SVM
• There are many different packages for solving SVMs
• In PS0 we have you use the liblinear package. This is an efficient implementation but can only use a linear kernelhttp://www.csie.ntu.edu.tw/~cjlin/liblinear/
• If you wish to have more flexibility with your choice of kernel you can use the LibSVM packagehttp://www.csie.ntu.edu.tw/~cjlin/libsvm/
9/28/20129
Section 1 - Jonathan Krause
Topics• Support Vector Machines• Boosting
– Viola-Jones face detector• Linear Algebra Review
– Notation– Operations & Properties– Matrix Calculus
• Probability– Axioms– Basic Properties– Bayes Theorem, Chain Rule
9/28/201210
Section 1 - Jonathan Krause
Each data point has
a class label:
wt =1and a weight:
+1 ( )
-1 ( )yt =
Boosting
• It is a sequential procedure:
9/28/2012
xt=1
xt=2
xt
Y. Freund and R. Schapire, A short introduction to boosting, Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September, 1999.
11
Section 1 - Jonathan Krause
Toy example
9/28/2012
Weak learners from the family of lines
h => p(error) = 0.5 it is at chance
Each data point has
a class label:
wt =1and a weight:
+1 ( )
-1 ( )yt =
12
Section 1 - Jonathan Krause
Toy example
9/28/2012
This one seems to be the best
This is a ‘weak classifier’: It performs slightly better than chance.
Each data point has
a class label:
wt =1and a weight:
+1 ( )
-1 ( )yt =
13
Section 1 - Jonathan Krause
Toy example
9/28/2012
Each data point has
a class label:
wt wt exp{-yt Ht}
We update the weights:
+1 ( )
-1 ( )yt =
14
Section 1 - Jonathan Krause
Toy example
9/28/2012
Each data point has
a class label:
wt wt exp{-yt Ht}
We update the weights:
+1 ( )
-1 ( )yt =
15
Section 1 - Jonathan Krause
Toy example
9/28/2012
Each data point has
a class label:
wt wt exp{-yt Ht}
We update the weights:
+1 ( )
-1 ( )yt =
16
Section 1 - Jonathan Krause
Toy example
9/28/2012
Each data point has
a class label:
wt wt exp{-yt Ht}
We update the weights:
+1 ( )
-1 ( )yt =
17
Section 1 - Jonathan Krause
Toy example
9/28/2012
The strong (non- linear) classifier is built as the combination of all the weak (linear) classifiers.
f1 f2
f3
f4
18
Section 1 - Jonathan Krause
• Defines a classifier using an additive model:
Boosting
9/28/2012
Strong classifier
Weak classifier
WeightFeaturesvector
19
Section 1 - Jonathan Krause
• Defines a classifier using an additive model:
Boosting
• We need to define a family of weak classifiers
9/28/2012
Strong classifier
Weak classifier
WeightFeaturesvector
form a family of weak classifiers
20
Section 1 - Jonathan Krause
Why boosting?
• A simple algorithm for learning robust classifiers– Freund & Shapire, 1995– Friedman, Hastie, Tibshhirani, 1998
• Provides efficient algorithm for sparse visual feature selection– Tieu & Viola, 2000– Viola & Jones, 2003
• Easy to implement, doesn’t require external optimization tools.
9/28/201221
Section 1 - Jonathan Krause
Boosting - mathematics • Weak learners
9/28/2012
value of rectangle feature
threshold
1 if ( )( )
0 otherwisej j
j
f xh x
1 1
11 ( )
( ) 20 otherwise
T T
t t tt th x
h x
• Final strong classifier
22
Section 1 - Jonathan Krause
Weak classifier
9/28/2012
• 4 kind of Rectangle filters
• Value =
∑ (pixels in white area) – ∑ (pixels in black area)
Credit slide: S. Lazebnik
23
Section 1 - Jonathan Krause
…..1( ,1)x
2( ,1)x 3( ,0)x 4( ,0)x
( , )n nx y1
5( ,0)x 6( ,0)x
Weak classifierthreshold
1 if ( )( )
0 otherwisej j
j
f xh x
1. Evaluate each rectangle filter on each example
Viola & Jones algorithm
9/28/2012
0.8 0.7 0.2 0.3 0.8 0.1
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.
25
Section 1 - Jonathan Krause
Viola & Jones algorithm• For a 24x24 detection region,
9/28/2012
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.
26
Section 1 - Jonathan Krause
b. For each feature, j ( )j i j i iiw h x y
1 if ( )( )
0 otherwisej j
j
f xh x
c. Choose the classifier, ht with the lowest error t
1 ( )1, ,
t i ih x yt i t i tw w
1t
tt
2. Select best filter/threshold combination,
,
,1
t it i n
t jj
ww
w
a. Normalize the weights
3. Reweight examples
Viola & Jones algorithm
9/28/2012
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.
27
Section 1 - Jonathan Krause
4. The final strong classifier is
1 1
11 ( )
( ) 20 otherwise
T T
t t tt th x
h x
1logt
t
The final hypothesis is a weighted linear combination of the T hypotheses where the weights are inversely proportional to the training errors
Viola & Jones algorithm
9/28/2012
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.
28
Section 1 - Jonathan Krause
Boosting for face detection
• For each round of boosting:1. Evaluate each rectangle filter on each example2. Select best filter/threshold combination3. Reweight examples
9/28/201229
Section 1 - Jonathan Krause
The implemented system• Training Data
– 5000 faces• All frontal, rescaled to
24x24 pixels– 300 million non-faces
• 9500 non-face images– Faces are normalized
• Scale, translation
• Many variations– Across individuals– Illumination– Pose
9/28/2012
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.
30
Section 1 - Jonathan Krause
System performance
• Training time: “weeks” on 466 MHz Sun workstation• 38 layers, total of 6061 features• Average of 10 features evaluated per window on test set• “On a 700 Mhz Pentium III processor, the face detector can
process a 384 by 288 pixel image in about .067 seconds” – 15 Hz– 15 times faster than previous detector of comparable accuracy
(Rowley et al., 1998)
9/28/2012
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.
31
Section 1 - Jonathan Krause
Output of Face Detector on Test Images
9/28/2012
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.
32
Section 1 - Jonathan Krause
Topics• Support Vector Machines• Boosting
– Viola-Jones face detector• Linear Algebra Review
– Notation– Operations & Properties– Matrix Calculus
• Probability– Axioms– Basic Properties– Bayes Theorem, Chain Rule
9/28/201233
Section 1 - Jonathan Krause
Linear Algebra in Computer Vision
• Representation– 3D points in the scene– 2D points in the image (Images are matrices)
• Transformations– Mapping 2D to 2D– Mapping 3D to 2D
9/28/201234
Section 1 - Jonathan Krause
Notation
• We adopt the notation for a matrix
which is a real valued matrix with m rows, and n columns
• We adopt the notation for a column vector, and a row vector respectively
9/28/201235
Section 1 - Jonathan Krause
Notation
• To indicate the element in the ith row and jth column of a matrix we use
• Similarly to indicate the ith entry in a vector we use
9/28/201236
Section 1 - Jonathan Krause
Norms
• Intuitively the norm of a vector is the measure of its “length”
• The l2 norm is defined as
in this class we will use the l2 norm unless otherwise noted. Thus we drop the 2 subscript on the norm for convenience.
• Note that
9/28/201237
Section 1 - Jonathan Krause
Linear Independence and Rank
• A set of vectors is linearly independent if no vector in the set can be represented as a linear combination of the remaining vectors in the set
• The rank of a matrix is the maximal number of linearly independent column or rows of a matrix
• • • •
9/28/201238
Section 1 - Jonathan Krause
Range and Nullspace
• The range of a matrix is the span of the columns of the matrix, denoted by the set
• The nullspace of a matrix, is the set of vectors that when multiplied by the matrix result in 0, given by the set
9/28/201239
Section 1 - Jonathan Krause
Eigenvalues and Eigenvectors
• Given a matrix, and are said to be an eigenvalue and the corresponding eigenvector of the matrix if
• We can solve for the eigenvalues by solving for the roots of the polynomial generated by
9/28/201240
Section 1 - Jonathan Krause
Eigenvalue Properties
•
• The rank of a matrix is equal to the number of its non-zero eigenvalues
• Eigenvalues of a diagonal matrix, are simply the diagonal entries
• A matrix is said to be diagonalizable if we can write
9/28/201241
Section 1 - Jonathan Krause
Eigenvalues & Eigenvectors of Symmetric Matrices
• Eigenvalues of symmetric matrices are real• Eigenvectors of symmetric matrices are orthonormal• Consider the optimization problem involving the
symmetric matrix
the maximizing is the eigenvector corresponding to the largest eigenvalue
9/28/201242
Section 1 - Jonathan Krause
Generalized Eigenvalues
• Generalized Eigenvalue problem
• Generalized eigenvalues must satisfy
• This reduces to the original eigenvalue problem when exists
• Generalized eigenvalues are used in Fisherfaces 9/28/201243
Section 1 - Jonathan Krause
Singular Value Decomposition (SVD)
• The SVD of matrix is given by
• Where are the columns of and called the left singular vectors is a diagonal matrix whose values are , and called the singular values are the columns of , and are called the right singular vectors
9/28/201244
Section 1 - Jonathan Krause
SVD
• If the matrix has rank , then has non-zero singular values
• are an orthonormal basis for
• are an orthonormal basis for
• Singular values of are the square root of the non-zero eigenvalues of or
9/28/201245
Section 1 - Jonathan Krause
Matlab
• [V,D] = eig(A)The eigenvectors of A are the columns of V. D is a diagonal matrix whose entries are the eigenvalues of A.
• [V,D] = eig(A,B)The generalized eigenvectors are the columns of V. D is a diagonal matrix whose entries of the generalized eigenvalues.
• [U,S,V] = svd(X)The columns of U are the left singular vectors of X. S is a diagonal matrix whose entries are the singular values of X. The columns of V are the right singular vectors of X. Recall X = U*S*V’;
9/28/201246
Section 1 - Jonathan Krause
Matrix Calculus -- Gradient
• Let then the gradient is given by
• is always the same size as , thus if we just have a vector the gradient is simply
9/28/201247
Section 1 - Jonathan Krause
Gradients
• From partial derivatives
• Some common gradients
9/28/201248
Section 1 - Jonathan Krause
Topics• Support Vector Machines• Boosting
– Viola-Jones face detector• Linear Algebra Review
– Notation– Operations & Properties– Matrix Calculus
• Probability– Axioms– Basic Properties– Bayes Theorem, Chain Rule
9/28/201249
Section 1 - Jonathan Krause
Probability in Computer Vision
• Foundation for algorithms to solve– Tracking problems– Human activity recognition– Object recognition– Segmentation
9/28/201250
Section 1 - Jonathan Krause
Probability Axioms
• Sample space: The set of all the outcomes of a random experiment. Denoted by
• Event space: A set whose elements are subsets of . The event space is denoted by . For example
• Probability measure: A functionthat satisfies– – –
9/28/201251
Section 1 - Jonathan Krause
Conditional Probability
•
• Two events are independent if
• Conditional Independence
9/28/201253
Section 1 - Jonathan Krause
Product Rule
• From the definition of conditional probability we can write
• From the product rule we can derive the chain rule of probability
9/28/201254