Introducton to Convolutional Nerural Network with TensorFlow

Google confidential | Do not distribute

Introduction to Convolutional Neural Networkwith TensorFlow

Etsuji NakaiCloud Solutions Architect at Google2017/03/24 ver1.0

1

Background & Objective

2

● What's happening here?!

Image Classification Transfer Learning with Inception v3

https://codelabs.developers.google.com/codelabs/cpb102-txf-learning 3

● Let's study the underlying mechanism with this (relatively) simple CNN.

Convolutional Neural Network with Two Convolution Layers

RawImage

Softmax Function

PoolingLayer

ConvolutionFilter

・・・

ConvolutionFilter

・・・

・・・

Dropout Layer

Fully-connected Layer

PoolingLayer

ConvolutionFilter

・・・

ConvolutionFilter

PoolingLayer

・・・

PoolingLayer

4

● Launch Cloud Datalab.○ https://cloud.google.com/datalab/docs/quickstarts

● Open a new notebook and execute the following command.○ !git clone https://github.com/enakai00/cnn_introduction.git

● Find notebook files in "cnn_introduction" folder.

Jupyter Notebooks

5

Logistic Regression

6

● Training Set:○ N data points on (x, y) plane.

○ Data points belong to two categories which are labeled as t = 1, 0.

● Problem to solve:○ Find a straight line to classify the given

data.○ If there's no perfect answer (which

doesn't have any misclassification), find an optimal one in some sense.

Sample Problem

●

✕

7

● Define the straight line as below.

● We apply the maximum likelihood method to determine the parameter w.

● In other words, we will define a "probability to obtain the training set", and maximize it.

Logistic Regression: Theoretical Ground

x

y

●

✕

8

● The probability of t = 1 for a new data point at (x, y) should have the following properties.○ t = 0.5 on the separation line.○ for leaving away from the

separation line.

● This can be satisfied by translating f (x, y) into the probability through logistic sigmoid function σ(a).

Logistic Sigmoid Function

x

y

●

✕

9

P(x, y) increases in this direction

● Using the probability defined in the previous page, calculate the probability of reproducing the training set .

○ If , the probability of observing it at is ○ If , the probability of observing it at is○ These results can be expressed by a single equation as below.

(Remember that for any x.)

● Hence, the total probability of reproducing all data (likelihood function) is expressed as:

Likelihood Function of Logistic Regression

10

● Instead of maximizing the likelihood function, we generally minimize the following loss function to avoid the underflow issue of numerical calculation.

Loss Function

11

Gradient Descent Optimization

● By modifying parameters in the opposite direction of the gradient vector incrementally, it may eventually achieve the minimum.

12

Learning Rate and Convergence Issue

● Learning rate ε decides the "step size" of each modification.

● The convergence of the optimization depends on the learning rate value.

Converge

Diverge

http://sebastianruder.com/optimizing-gradient-descent/ 13

TensorFlow Programming

14

Programming Style of TensorFlow

● All data is represented by "multidimensional list".○ In many cases, you can use a two-dimension list which is equivalent to

the matrix. So by expressing models (functions) in the matrix form, you can translate them into TensorFlow codes.

● As a concrete example, we will write the following model (functions) in TensorFlow codes.

○ Pay attention to distinguish the following three objects.

■ Placeholder : a variable to store training data.■ Variable: parameters to be adjusted by the training algorithm.■ Functions constructed with Placeholders and Variables.

15


● The linear function representing the straight line can be expressed using matrix as below.

● should be treated as a Placeholder which holds multiple data simultaneously in general. So let represent n-th data and using the matrix holding all data for , you can write down the following matrix equation.

○ Where corresponds to the value of f for n-th data, and the "broadcast rule" is applied to the last part . This means adding

to all matrix elements.16


● Finally, by applying the sigmoid function σ to each element of f , the probability for each data is calculated.

○ The "broadcast rule" is applied to , meaning applying σ to each element of f .

● These relationships are expressed by TensorFlow codes as below.

x = tf.placeholder(tf.float32, [None, 2])

w = tf.Variable(tf.zeros([2, 1]))w0 = tf.Variable(tf.zeros([1]))

f = tf.matmul(x, w) + w0p = tf.sigmoid(f)

17


● This explains the relationship between matrix calculations and TensorFlow codes.


w = tf.Variable(tf.zeros([2, 1]))w0 = tf.Variable(tf.zeros([1]))

f = tf.matmul(x, w) + w0p = tf.sigmoid(f)

Placeholder stores training data

Matrix size (The size of row should be None to hold

arbitrary numbers of data.)

Variables represent parameters to be trained.

(Initializing to 0, here)

The "broadcast rule" (similar to NumPy array) is applied to calculations. 18

Error Function and Training Algorithm

● To train the model (i.e. to adjust the parameters), we need to define the error function and the training algorithm.

t = tf.placeholder(tf.float32, [None, 1])

loss = -tf.reduce_sum(t*tf.log(p) + (1-t)*tf.log(1-p))

train_step = tf.train.AdamOptimizer().minimize(loss)tf.reduce_sum adds up

all matrix elements.Using Adam Optimizer

to minimize "loss"

19

Calculations inside Session

● The TensorFlow codes we prepared so far just define functions and various relations without doing any calculation. We prepare a "Session" and actual calculations are executed in the session.

Placeholder Variable

Calculations

Placeholder

Session 20

Using Session to Train the Model

● Create a new session and initialize Variables inside the session.

● By evaluating the training algorithm inside the session, Variables are adjusted with the gradient descent method.○ "feed_dict" specifies the data which are stored in Placeholder.○ When functions are evaluated in the session, the corresponding values

are calculated using the current values of Variables.

i = 0for _ in range(20000): i += 1 sess.run(train_step, feed_dict={x:train_x, t:train_t}) if i % 2000 == 0: loss_val, acc_val = sess.run( [loss, accuracy], feed_dict={x:train_x, t:train_t}) print ('Step: %d, Loss: %f, Accuracy: %f' % (i, loss_val, acc_val))

sess = tf.Session()sess.run(tf.initialize_all_variables())

The gradient descent method is applied using the training data

specified by feed_dict. Calculating "loss" and "accuracy" using the current

values of Variables.

21

Exercise

● Run through the Notebook:○ No.1 Tensorflow Programming

22

Linear Multicategory Classifier

23

● Logistic regression gives the "probability of being classified as t = 1" for each data in the training set.

● Parameters are adjusted to minimize the following error function.

Recap: Logistic Regression

●

✕

P(x, y) increases in this direction

24

● Drawing 3-dimensional graph of , we can see that the “tilted plate” divides the

plane into two classes.

● Logistic function σ translates the height on the plate into the probability of t = 1.

Graphical Interpretation of Logistic Regression

Logistic function σ

z

25

● How can we divide the plane into three classes (instead of two)?

● We can define three linear functions and classify the point based on “which of them has the maximum value at that point.”

○ This is equivalent to dividing with three tilted plates.

Building Multicategory Linear Classifier

26

● We can define the probability that belongs to the i-th class with the following softmax function.

● This translates the magnitude of into the probability satisfying the

following (reasonable) conditions.

Translation to Probability with Softmax function

One dimensional example of "Softmax translation."27

Image Classificationwith Linear Multicategory Classifier

28

● A grayscale image with 28x28 pixels can be represented as a 784 dimensional vector which is a collection of 784 float numbers.○ In other words, it corresponds to a single point in a 784 dimensional

space!

Images as Points in High Dimensional Space

● When we spread a bunch of images into this 784 dimensional space, similar images may come together to form clusters of images.○ If this is a correct assumption, we

can classify the images by dividing the 784 dimensional space with the softmax function.

29

Matrix Representation

● To divide M dimensional space into K classes, we prepare the K linear functions.

● Defining n-th image data as , the values of linear functions for all data can be represented as below. (The broadcast rule is applied to "+ w" operation.)

30


● Here is the summary of the matrix representation.

Broadcast rule

31


● Finally, we can translate the result into a probability by applying softmax function. The probability of classified as k-th category for n-th data is:

● TensorFlow has "tf.nn.softmax" function which calculates them directly from the matrix F.

32

TensorFlow Codes of the Model

● The matrix representations we built so far can be written in TensorFlow codes as below.○ Pay attention to the difference between Placeholder and Variables.

x = tf.placeholder(tf.float32, [None, 784])w = tf.Variable(tf.zeros([784, 10]))w0 = tf.Variable(tf.zeros([10]))f = tf.matmul(x, w) + w0p = tf.nn.softmax(f)

33

Loss Function● The class label of n-th data is given by a vector with the one-of-K

representation. It has 1 only for the k-th element meaning it's class is k.

● Since the probability of having the correct answer for this data is , the probability of having correct answers for all data is calculated as below.

● We define the loss function as below. Then, minimizing the loss function is equivalent to maximizing the probability P.

only for k' = k (the class of n-th data)

34

TensorFlow Codes for Loss Function

● The loss function and the optimization algorithm can be written in TensorFlow codes as below.

● The following code calculates the accuracy of the model.○ "correct_prediction" is a list of bool values of "correct or incorrect."○ "accuracy" is calculated by taking the mean of bool values (1 for correct,

0 for incorrect.)

t = tf.placeholder(tf.float32, [None, 10])

loss = -tf.reduce_sum(t * tf.log(p))

train_step = tf.train.AdamOptimizer().minimize(loss)

correct_prediction = tf.equal(tf.argmax(p, 1), tf.argmax(t, 1))accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

35

Comparing Predictions and Class Labels

● The following shows how we calculate the correctness of predictions.

Predict the class of according to the maximum probability.

Comparing these to check the answer.

Indicates the correct class of

36

Mini-Batch Optimization of Parameters

● We repeat the optimization operations using 100 samples at a time.

i = 0for _ in range(2000): i += 1 batch_xs, batch_ts = mnist.train.next_batch(100) sess.run(train_step, feed_dict={x: batch_xs, t: batch_ts}) if i % 100 == 0: loss_val, acc_val = sess.run([loss, accuracy], feed_dict={x:mnist.test.images, t: mnist.test.labels}) print ('Step: %d, Loss: %f, Accuracy: %f' % (i, loss_val, acc_val))

……Image data

Label data

batch_xs

batch_ts

100 samples

・・・

・・・

Optimization

……

……

……

Optimization

batch_xs

batch_ts 37

Mini-Batch Optimization of Parameters

● Mini-batch optimization has the following advantages.○ Reduce the memory usage.○ Avoid being trapped in the local minima with the random movement.

Minimum Minimum

Stochastic gradient descent with mini-batch method.

Simple gradient descent method using all training data at once.

True minimum

Local minima

38

Exercise

Correct Incorrect

39

● Run through the Notebook:○ No.2 Softmax classifier for MNIST

Basic Strategy ofConvolutional Network

40

● The linear categorizer assumes that samples can be classified with flat planes.

● This cannot be a perfect assumption and fails to capture the global (topological) features of handwritten digits.

The limitation of Linear Categorizer

Correct Incorrect

Examples form the result of linear classifier.41

● The convolutional neural network (CNN) uses image filters to extract features from images and apply hidden layers to classify them.

The Overview of Convolutional Neural Network

RawImage

Softmax Function

PoolingLayer

ConvolutionFilter

・・・

ConvolutionFilter

・・・

・・・

Dropout Layer


PoolingLayer

ConvolutionFilter

・・・

ConvolutionFilter

PoolingLayer

・・・

PoolingLayer

42

● Convolutional filters are ... just an image filter you sometimes apply in Photoshop!

Examples of Convolutional Filters

Filter to blur images Filter to extract vertical edges

43

● To classify the following training set, what would be the best filters?

Question

44

● Applying image filters to capture various features of the image.○ For example, if we want to classify three characters "+", "-", "|", we can

apply filters to extract vertical and horizontal edges as below.

● Applying the pooling layer to (deliberately) reduce the image resolution.○ The necessary information for classification is just a density of the

filtered image.

How Convolutional Neural Network Works

45

46

def edge_filter():

filter0 = np.array(

[[ 2, 1, 0,-1,-2],

[ 3, 2, 0,-2,-3],

[ 4, 3, 0,-3,-4],

[ 3, 2, 0,-2,-3],

[ 2, 1, 0,-1,-2]]) / 23.0

filter1 = np.array(

[[ 2, 3, 4, 3, 2],

[ 1, 2, 3, 2, 1],

[ 0, 0, 0, 0, 0],

[-1,-2,-3,-2,-1],

[-2,-3,-4,-3,-2]]) / 23.0

filter_array = np.zeros([5,5,1,2])

filter_array[:,:,0,0] = filter0

filter_array[:,:,0,1] = filter1

return tf.constant(filter_array, dtype=tf.float32)

TensorFlow code to apply the filters


x_image = tf.reshape(x, [-1,28,28,1])

W_conv = edge_filter()

h_conv = tf.abs(tf.nn.conv2d(x_image, W_conv,

strides=[1,1,1,1], padding='SAME'))

h_conv_cutoff = tf.nn.relu(h_conv-0.2)

h_pool =tf.nn.max_pool(h_conv_cutoff, ksize=[1,2,2,1],

strides=[1,2,2,1], padding='SAME')

● In this model, we use pre-defined (fixed) filters to capture vertical and horizontal edges.

● Question: How can we choose appropriate filters for more general images?

Simple Model to Classify "+", "-", "|".

Input image

Convolution filterPooling layer

Softmax

47

Exercise

48

● Run through the Notebook:○ No.3 Convolutional Filter Example○ No.4 Toy model with static filters

Dynamic Optimization ofConvolution Filters

49

● In the convolutional neural network, we define filters as Variable. The optimization algorithm tries to adjust the filter values to achieve better predictions.○ The following code applies 16 filters to images with 28x28 pixels(=784

dimensional vectors).

Dynamic Optimization of Filters

num_filters = 16

x = tf.placeholder(tf.float32, [None, 784])x_image = tf.reshape(x, [-1,28,28,1])

W_conv = tf.Variable(tf.truncated_normal([5,5,1,num_filters], stddev=0.1))h_conv = tf.nn.conv2d(x_image, W_conv, strides=[1,1,1,1], padding='SAME')h_pool =tf.nn.max_pool(h_conv, ksize=[1,2,2,1], strides=[1,2,2,1], padding='SAME')

Placeholder to store images

Define filters as VariablesApply filters and pooling layer

50

Exercise

51

● Run through the Notebook:○ No.5 Single layer CNN for MNIST○ No.6 Single layer CNN for MNIST result

(Since filtered images contains negative pixel values, the background of images are not necessarily white.)

● By adding more filter (and pooling) layers, we can build multi-layer CNN.○ Filters in different layers are believed to recognize different kinds of

features, but details are still under the study.○ Dropout layer is used to avoid overfitting by randomly cutting the part of

connections during the training.

Multi-layer Convolutional Neural Network

RawImage

Softmax Function

PoolingLayer

ConvolutionFilter

・・・

ConvolutionFilter

・・・

・・・

Dropout Layer


PoolingLayer

ConvolutionFilter

・・・

ConvolutionFilter

PoolingLayer

・・・

PoolingLayer

52

● Run through the Notebook:○ No.7 CNN Handwriting Recognizer

Exercise

The images which passed through the second filters.

Predicting the handwritten number.

53

Neural Network Basics

54

Single Layer Neural Network

● This is an example of a single layer neural network.○ Two nodes in the hidden layer transform the

value of a linear function with the activation function.

○ There are some choices for the activation function. We will use the hyperbolic tangent in the following examples.

Logistic sigmoid

Hyperbolictangent

ReLU

Hidden layer

Output layer

55


● Since the output from the hyperbolic tangent quickly changes from -1 to 1, the outputs from the hidden layer effectively split the input space into discrete regions with straight lines.○ In this example, plane is split into 4 regions.

①

②

③ ④

56


● The logistic sigmoid in the output node can classify the plane with a straight line, this single layer network can classify the 4 regions into two classes as below.

◯

◯

✕

◯

①②

④③

①

②

③④

57

Limitation of Single Layer Network

● On the other hand, this neural network cannot classify data in the following pattern.○ How can you extend the network to cope with this data?

◯

◯

✕

✕

①②

④③

Unable to classify with a straight line.

58

Neural Network as Logical Units

■ A single node (consisted of a linear function and the activation function) works as a logical Unit for AND or OR as below.

●

●

●

●

●

●

●

●

59


● Since the previous pattern is equivalent to XOR, we can combine the AND and OR units to make a XOR unit. As a result, the following "Enhanced output node" can classify the previous pattern.

◯

◯

✕

✕

①②

④③

AND Ops

OR Ops

XOR Ops

60


● Combining the hidden layer and the "enhanced output unit", it results in the following 2-layer neural network.○ The first hidden layer extract features as a combination of binary variables

, and the second hidden layer plus output node classify them as a XOR logical unit.

Classifying with XOR Logical Unit

Extracting Features

61

Exercise● You can see the actual result on Neural Network Playground.

○ http://goo.gl/VIvOaQ

62

Thank you!

63

Science

Introducton to Convolutional Nerural Network with TensorFlow