Overview of Back Propagation Algorithm Shuiwang Ji

Overview of Back Propagation Algorithm

Shuiwang Ji

A Sample Network

Forward Operation

• The general feed-forward operation is:

Back Propagation Algorithm• The hidden to output weights can be learned by minimizing the error

• The power of back-propagation is that it allows us to calculate an effective error for each hidden unit, and thus derive a learning rule for the input-to-hidden weights

• We consider the error function:

• The update rule is:

Hidden-to-output Weights

The chain rule:

The sensitivity of unit k is:

and

Overall, the derivative is:

Input-to-hidden WeightsThe chain rule:

The real back propagation:

Overall the rule is:

Back Propagation of Sensitivity

1. The sensitivity at a hidden unit is proportional to the weighted sum of the sensitivities at the output units

2. The output unit sensitivities are thus propagated “back” to the hidden units

Training Hierarchical Feed-forward Visual Recognition Models Using

Transfer Learning from Pseudo-Tasks

ECCV’08 Kai Yu

Presented by Shuiwang Ji

Transfer Learning

• Transfer learning, also known as multi-task learning, is a mechanism that improves generalization by leveraging shared domain-specific information contained in related tasks

• In the setting considered in this paper, all tasks share the same input space

General Formulation

• The main task to be learnt has index m with training examples

• A neural network has a natural architecture to tackle this learning problem by minimizing:

General Formulation

• The is learned by additionally introducing pseudo auxiliary tasks, each represented by learning the input-output pairs:

• Then the regularization term becomes

• A Bayesian perspective (skipped)

CNN for Transfer Learning• Input: 140x140 pixel images, including R/G/B channels and additionally two

channels Dx and Dy, which are the horizontal and vertical gradients of gray intensities

• C1 layer: 16 filters of size 16 by 16• P1 layer: max pooling over each 5 by 5 neighborhood• C2 layer: 256 filters of size 6 by 6, connections with sparsity 0.5 between

the 16 dimensions of P1 layer and the 256 dimensions of C2 layer• P2 layer: max pooling over each 5 by 5 neighborhood• Output layer: full connections between (256 by 4 by 4) P2 features and

outputs

Generating Pseudo Tasks

1. The pseudo-task is constructed by sampling a random 2D patch and using it as a template to form a local 2D filter that operates on every training image. The value assigned to an image under this task is taken to be the maximum over the result of this 2D convolution operation

2. brittle to scale, translation, and slight intensity variations

Generating Pseudo Tasks

• Applying Gabor filters of 4 orientations and 16 scales result in 64 feature maps of size 104*104 for each image

• Max-pooling operation is performed first within each non-overlapping 4*4 neighborhood and then within each band of two successive scales resulting in 32 feature maps of size 26*26 for each image

• An set of K RBF filter of size 7*7 with 4 orientations are then sampled and used as the parameters of the pseudo-tasks, resulting in 8 feature maps of size 20*20

• Finally, max pooling is performed on the result across all the scales and within every non-overlapping 10*10 neighborhood, giving a 2*2 feature map which constitutes the value of this image under this pseudo-task

• Obtained 4*K pseudo-tasks (K actual random patches, each operating at a different quadrant of the image)

Object Class Recognition and Localization Using Sparse Features with Limited Receptive Fields, IJCV, in press

Results on Caltech-101

0.18 second for testing one image (the forward operation)

Gender and Ethnicity Recognition

First-layer Features

Convergence Rate

Documents

Overview of Back Propagation Algorithm Shuiwang Ji