Deep Learning Introduction - Cross Entropy · 1. XOR (eXclusive Or): my perception is the inability of perceptron to solve this problem cast a shadow on AI (though this was understood

Deep Learning IntroductionApril 10, 2019

[email protected]

http://cross-entropy.net/ML410/Deep_Learning_0.pdf

mailto:[email protected]

http://cross-entropy.net/ML410/Deep_Learning_0.pdf

Agenda for Tonight

• About the Class

• About the Textbooks

• Chapter 1: Deep Learning with Python

• Chapter 1: Introduction to Deep Learning

• Homework

Course Topics

• Forward Propagation of Features and Backward Propagation of Loss for Machine Learning

• Tensorflow Basics

• Keras Basics

• Multi-Layer Perceptron models

• Convolution: Filters, Strides, and Padding

• Available Keras Models for Image Classification, including the Residual Network (ResNet) model and the Dense Network (DenseNet) model

• Word Embeddings

• Recurrent Neural Networks, including Long Short-Term Memory cells and Gated Recurrent Unit cells

• Sequence-to-Sequence Models

• Transformers and Bidirectional Encoder Representations from Transformers

• Auto-Encoders

• Generative Adversarial Networks

• Deep Reinforcement Learning

• Best Practices

Technical Requirements

• We'll be using the Tensorflow and Keras libraries to construct and evaluate Deep Learning models. You'll need access to a browser, such as Chrome, to access the Vocareum lab environment as well as the Canvas learning management system. We’ll also be using the following libraries for working with data:• Python Imaging Library (PIL): Pillow fork

• syntactic parsing using Cython (spaCy) [also NLTK]

• libROSA: library for the Recognition and Organization of Speech and Audio

https://www.vocareum.com/

https://canvas.uw.edu/courses/1260521

Student Assessment

Weekly assignments will include 14 questions from our textbook as well as 9 Kaggle tasks. Homework questions from the textbook will be worth 1 point each. Kaggle tasks will be worth 4 points each. For the Kaggle tasks, you must beat a "baseline" method to receive credit. You need to attend at least 8 class sessions and receive a total of at least 25 points to pass the course.

Textbook #1: Deep Learning with Python[DLP]1. What is deep learning?

2. Before we begin: the mathematical building blocks of neural networks

3. Getting started with neural networks

4. Fundamentals of machine learning

5. Deep learning for computer vision

6. Deep learning for text and sequences

7. Advanced deep-learning best practices

8. Generative deep learning

9. Conclusions

Textbook #2: Introduction to Deep Learning[IDL]1. Feed-Forward Neural Nets

2. Tensorflow

3. Convolutional Neural Networks

4. Word Embeddings and Recurrent Neural Networks

5. Sequence-to-Sequence Learning

6. Deep Reinforcement Learning

7. Unsupervised Neural-Network Models

Textbook #1

The cover of our text book is captioned “Habit of a Persian Lady in 1568”, from Thomas Jeffreys’ book, “A Collection of the Dresses of Different Nations”

[DLP] Chapter 1: What is Deep Learning?

1. Artificial Intelligence, Machine Learning, and Deep Learninga. Artificial Intelligenceb. Machine Learningc. Learning Representations from

Datad. The “Deep” in Deep Learninge. Understanding How Deep

Learning Works, in Three Figuresf. What Deep Learning Has

Achieved So Farg. Don’t Believe the Short-Term

Hypeh. The Promise of AI

2. Before Deep Learning: a Brief History of Machine Learninga. Probabilistic Modelingb. Early Neural Networksc. Kernel Methodsd. Decision Trees, Random Forests,

and Gradient Boosting Machinese. Back to Neural Networksf. What Makes Deep Learning

Differentg. The Modern Machine Learning

Landscape

[DLP] Chapter 1: What is Deep Learning?

3. Why Deep Learning? Why Now?a. Hardware

b. Data

c. Algorithms

d. A New Wave of Investment

e. The Democratization of Deep Learning

f. Will it Last?

• This chapter covers• High-level definitions of

fundamental concepts

• Timeline of the development of machine learning

• Key factors behind deep learning’s rising popularity and future potential

Artificial Intelligence

• Concise definition: the effort to automate intellectual tasks normally performed by humans

• Initial take: expert rules• Fine for chess

• Difficult to develop rules for image classification, speech recognition, or language translation


Relationships Between AI, ML, and DL

• Expert Rules

• Linear Regression

• Logistic Regression

• Random Forests

• Gradient Boosting

• Multi-Layer Perceptron Network

• Convolutional Neural Networks

• Recurrent Neural Networks


Expert Rules Example

% Data: fruit(X) :- attributes(Y)

fruit(banana) :- colour(yellow), shape(crescent).

fruit(apple) :- (colour(green); colour(red)), shape(sphere), stem(yes).

fruit(lemon) :- colour(yellow), (shape(sphere);shape('tapered sphere')), acidic(yes).

fruit(lime) :- colour(green), shape(sphere), acidic(yes).

fruit(pear) :- colour(green), shape('tapered sphere').

fruit(plum) :- colour(purple), shape(sphere), stone(yes).

fruit(grape) :- (colour(purple);colour(green)), shape(sphere).

fruit(orange) :- colour(orange), shape(sphere).

fruit(satsuma) :- colour(orange), shape('flat sphere').

fruit(peach) :- colour(peach).

fruit(rhubarb) :- (colour(red); colour(green)), shape(stick).

fruit(cherry) :- colour(red), shape(sphere), stem(yes), stone(yes).

What is the value for colour?

[red, orange, yellow, green, purple, peach]

green

What is the value for shape?

[sphere, crescent, tapered sphere, flat sphere, stick]

stick

The fruit is rhubarb

http://www.paulbrownmagic.com/blog/simple_prolog_expert


http://www.paulbrownmagic.com/blog/simple_prolog_expert

Machine Learning

• Ada Lovelace, 1843: “The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform.”

• Alan Turing, 1950: quoting Ada Lovelace, while pondering whether general-purpose computers could be capable of learning

trained versus programmed


Learning Representations from Data

• Need three things (for supervised learning)• Input data points: structured data, image files, sound files, text documents

• Examples of expected output

• Way to measure whether the algorithm is doing a good job

• Input representation examples• Image as Red, Green, and Blue picture element (pixel) values

• Image as Hue, Saturation, and Value

https://en.wikipedia.org/wiki/HSL_and_HSV


https://en.wikipedia.org/wiki/HSL_and_HSV

Example Data

• The inputs are the coordinates of our points

• The expected outputs are the colors of the points

• A way to measure whether our algorithm is doing a good job could be the percentage of points that are being classified correctly (accuracy)


New Representation

import numpy as np

translation = np.array([ -2, -2 ])

theta = 0.25 * np.pi

Rotation = np.array([[ np.cos(theta), - np.sin(theta) ], [np.sin(theta), np.cos(theta) ]])

np.dot(Input + translation, Rotation)


Deep Neural Network for Digit Classification

• Successive layers of increasingly meaningful representations

• Alternative names for deep learning• Layered representations learning

• Hierarchical representations learning


Deep Representations Learned by aDigit-Classification Model


How Deep Learning Works: Part 1 of 3

Neural Network Parameterized by its Weights



Loss Function Measures Quality of Network’s Output



Loss Score Used as Feedback Signal to Adjust Weights


Deep Learning Achievements

• Near-human-level image classification

• Near-human-level speech recognition

• Near-human-level handwriting transcription

• Improved machine translation

• Improved text-to-speech conversion

• Digital assistants such as Google Now and Amazon Alexa

• Near-human-level autonomous driving

• Improved ad targeting, as used by Google, Baidu, and Bing

• Improved search results on the web

• Ability to answer natural-language questions

• Superhuman Go playing


Deep Learning Hype

• Although some world-changing applications like autonomous cars are already within reach, many more are likely to remain elusive for a long time, such as believable dialogue systems, human-level machine translation across arbitrary languages, and human-level natural language understanding

• Previous AI “Winters”:1. XOR (eXclusive Or): my perception is the inability of perceptron to solve this

problem cast a shadow on AI (though this was understood at the time)2. By the early 90s, rule-based systems had proven expensive to maintain

difficult to scale, and limited in scope

• We are currently in the intense optimism phase of a new cycle


Promise of AI

• Most of the research findings of deep learning aren’t yet applied to the full range of problems they can solve across industries• “Your doctor doesn’t use AI, and neither does your accountant” [I thought I

uploaded an image of a document during tax time]

• “Back in 1995, it would have been difficult to believe in the future impact of the internet”

• “In a not-so-distant future, AI will be your assistant; it will answer your questions, help educate your kids, and watch over your health. It will deliver your groceries to your door and drive you from point A to point B.”

• Don’t believe the short-term hype, but do believe in the long-term vision


History: Probabilistic Modeling

• Naïve Bayes

• Logistic Regression

Machine Learning

Early Neural Networks

• 1950s: Perceptron

• 1980s: Backpropagation

• Late 1980s: Yann LeCun’s work on MNIST

Machine Learning

History: Kernel Method

• Example kernel method for classification

Machine Learning

History: Decision Tree Ensembles

• Parameters that are learned are questions about the data“Is feature2 in the data greater than 3.5?”

Machine Learning

History: Back to Neural Networks

• AlexNet was not the first fast GPU-implementation of a CNN to win an image recognition contest. A CNN on GPU by K. Chellapilla et al. (2006) was 4 times faster than an equivalent implementation on CPU.[6] A deep CNN of Dan Ciresan et al. (2011) at IDSIA was already 60 times faster[7] and achieved superhuman performance in August 2011.[8] Between May 15, 2011 and September 10, 2012, their CNN won no less than four image competitions.[9][10] They also significantly improved on the best performance in the literature for multiple image databases.[11]

• According to the AlexNet paper,[5] Ciresan's earlier net is "somewhat similar." Both were originally written with CUDA to run with GPU support. In fact, both are actually just variants of the CNN designs introduced by Yann LeCun et al. (1989)

https://en.wikipedia.org/wiki/AlexNet

Machine Learning

https://en.wikipedia.org/wiki/AlexNet#cite_note-6

https://en.wikipedia.org/wiki/IDSIA

https://en.wikipedia.org/wiki/AlexNet#cite_note-flexible-7



https://en.wikipedia.org/wiki/AlexNet#cite_note-schdeepscholar-10

https://en.wikipedia.org/wiki/Database

https://en.wikipedia.org/wiki/AlexNet#cite_note-mcdns-11

https://en.wikipedia.org/wiki/AlexNet#cite_note-:0-5

https://en.wikipedia.org/wiki/CUDA

https://en.wikipedia.org/wiki/GPU

https://en.wikipedia.org/wiki/Yann_LeCun

https://en.wikipedia.org/wiki/AlexNet

Reasons for Deep Learning Success

• Incremental layer-by-layer way in which increasingly complex representations are developed

• Fact that these intermediate incremental representations are learned jointly

Machine Learning

Why Deep Learning? Why Now?

• Three technical forces driving advances in machine learning• Hardware

• Datasets and benchmarks

• Algorithmic advances

• Following a scientific revolution, progress generally follows a sigmoid curve: it starts with a period of fast progress, which generally stabilizes as researchers hit hard limitations, and then, further improvements become incremental

Deep Learning

Models to Try

• Deep Learning should be viewed as another tool in the toolbox

• There are many possible machine learning methods to apply

• Suggestion is to try …• Linear models

• Tree-based ensembles; e.g. random forests and gradient boosting

• Deep learning; e.g. feedforward, convolutional, and recurrent networks

• Gradient boosting and deep learning have won a lot of Kaggle competitions

Deep Learning

Textbook #2

The cover of our text book pays homage to the Modified National Institutes of Standards and Technology data set, which is what led to the development of a convolutional neural network [learning the filters to classify handwritten zip codes, instead of handcrafting filters]

About the Textbook

• Reads like a rough draftExample: equation 1.21 in the first printing is incorrect

[change the X(uppercase phi) to lj]

• From the preface: “So I did what any self-respecting professor would do, scheduled myself to teach the stuff, started a crash course by surfing the web, and got my students to teach it to me”

• Put down the torches and pitchforks: this book does provide a decent introduction to the theory behind deep learning

[IDL] Chapter 1: Feed-Forward Neural Nets

1. Perceptrons

2. Cross-entropy Loss Functions for Neural Nets

3. Derivatives and Stochastic Gradient Descent

4. Writing Our Program

5. Matrix Representation of Neural Nets

6. Data Independence

7. References and Further Reading

8. Written Exercises

MNIST Digit as a Matrix

MNIST

MNIST Digit as an Image

image looks funkybecause we’ve zoomedin on a low-resolution(28 x 28) image

MNIST

Schematic Diagram of a Perceptron

Perceptrons

A Typical Neuron

• A single biological neuron typically has …• many inputs (dendrites)

• a cell body

• a single output (the axon)

• Neural networks are inspired by the biological neuron (not an accurate representation of how learning is performed by humans)

Perceptrons

The Perceptron Function

The perceptron function is parameterized by the vector Φ, where 𝜙𝑖 ∈ Φ is the ith

parameter [“fee sub eye” is an element of “cap fee (capital fee)”]b is a bias termw is a weight vector

The capital sigma (Σ) is a summation operator, with index i.

Nota Bene (note well): a neuron is a weight vector

Perceptrons

The Dot Product

The dot product is the inner product of two vectors.

The dot product is the numerator of the cosine of the angle between the two vectors. It can be viewed as measuring the similarity between the two vectors. A large positive number means the two vectors are similar, while a large negative number means the two vectors are dissimilar. Normalizing (dividing) the dot product by the product of the two vector lengths yields a number in the interval [-1, 1]. Taking the dot product of an input vector and a weight vector yields a new “feature”, measuring how similar the input vector is to the weight vector. This is why deep learning is sometimes referred to as learning representations [as in the International Conference on Learning Representations (ICLR); the other important Deep Learning conference is the Neural Information Processing Systems conference (NeurIPS)].

This is *the* most common operation in deep learning. We’ve peaked early. It’s all downhill from here ☺

Perceptrons

Input Example

𝒙𝑘 = 𝑥1𝑘 … 𝑥𝑙

𝑘

𝒙𝑘 identifies the input vector for example “k”, with features 1 through “l” (lower-case “l”)

I would not have chosen lower-case “l” to represent the length of the input feature vector, but I want to be consistent with the book [which is a decent textbook for deep learning]Note: a lower-case bold letter often represents a vector, while an upper-case bold letter often represents a matrix.

𝑎𝑘 identifies the output answer for example “k”

Perceptrons

The Perceptron Learning Algorithm

The capital delta (Δ) represents the quantity to be added to the weight.

We are treating the bias b as a weight for the imaginary feature 𝑥0 = 1.

N is our first hyperparameter, a value assigned by us to control the learning algorithm’s behavior.

Perceptrons

Multiple Perceptrons for Multiple Classes

One perceptron for each class

Perceptrons

Neural Network Showing Layers

• First rectangle represents the 5 input features: the Input layer

• Second rectangle represents the 3 output neurons: the Perceptron layer

Perceptrons

Weight Updates

• The weight update is equal to the negated learning rate multiplied by the partial derivative of loss with respect to the weight• This is true for *any* weight in the network

• The number of operations (e.g. products and activations) between the weight and the output of the model dictates the number of terms that comprise the partial derivative of loss with respect to the weight

• Δ: upper-case delta [I say “cap delta”]

• 𝜙: lower-case phi [I say “fee”]

• ℒ: script L

Cross-Entropy

Loss as a Function of 𝜙1

• Our goal is to minimize the loss function value (vertical axis)

• This said, we try to avoid “large” weights

Cross-Entropy

Softmax

• Softmax function takes logit [I say “low-jit”] values as inputs and produces probability estimates as outputs

• Softmax is an activation function, not a layer unto itself• “Is this a layer?”• “Does it have weights that are applied to activations from a previous layer?”

• 𝜎: lower-case sigma

• ℯ: script ‘e’ [Euler’s number ≈ 2.71828 : a constant; not a variable]

• Σ: upper-case sigma [summation operator; not a variable]

• bold, lower-case x: a vector [bold, upper-case X: a matrix]

Cross-Entropy

Cross-Entropy Loss Function

• 𝑋 : italicized ‘X’ represents the loss function

• Φ: upper-case phi [“cap fee”] represents the model parameters

• 𝑥: italicized ‘x’ represents an input vector

• ln: the natural logarithm function [Euler’s number is the base]

• 𝑎𝑥: the actual label for input vector ‘x’

Cross-Entropy

Graph of -ln x

• Reminder: a probability estimate will never be larger than 1

• -ln 0 = Infinity• clip softmax probability estimates to avoid this

• min(max(estimate, 1e-6), 1 - 1e-6)

• 1e-6 = 1 * 10^(-6) = 0.000001

• min(max(estimate, 0.000001), 0.999999)

Cross-Entropy

Relationships between Cross Entropy Loss, Softmax Activation, and Perceptron Outputs• Cross entropy loss

• Softmax activation

• Perceptron (product) output

Derivatives

Chain Rule for Weight Update

Example from book is for last (only) layer

𝜕𝑙𝑜𝑠𝑠

𝜕𝑤𝑒𝑖𝑔ℎ𝑡=

𝜕𝑝𝑟𝑜𝑑𝑢𝑐𝑡

𝜕𝑤𝑒𝑖𝑔ℎ𝑡



Book uses …• X() for cross entropy loss• lowercase ‘l’ for logit

(product of input and weight vectors)

• lowercase ‘p’ for the softmax probability output


𝜕𝑝𝑟𝑜𝑑𝑢𝑐𝑡=

𝜕𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛












Book says Dave says

Putting those two equations together for the weight update

Derivatives

Partial Derivatives Used for Chain Rule

The three components for updating weights in the last layer (multiply them together to derive the desired gradient) …









Derivatives

Pseudocode for Simple Feed-Forward Digit Recognition

Derivatives

Matrix Manipulation

• X = Y + Z [element-wise addition]

• X = Y * Z [dot-products between rows of Y and columns of Z]

Matrix Representation

Quick Example

L = X * W + B• L is the layer output• X is the input batch• W is the Weight matrix• B is the Bias vector


Matrix Transposition

Exchanging rows and columns


Matrix Form of Weight Update

The upside down triangle is the symbol for the gradient, a multi-variable generalization of the derivative [below: it’s a vector of partial derivatives of loss with respect to the perceptron (product) output]


Matrix Representation of Neural Networks

• We often use X for the input matrix, W for the Weight matrix, and b for the bias vector

• L represents a layer output: L = X * W + b


Data Independence

• iid assumption: data observations are independent and identically distributed

• Simply randomizing (randomly shuffling) the data presented for training can improve results• this is handled for us by the “shuffle = True” default of Keras’ “fit()” method

• google “keras model fit shuffle” and click the “Jump to fit” link

• shuffling between epochs means the learning algorithm is unlikely to see the same batches again

Data Independence

Documents

Deep Learning Introduction - Cross Entropy · 1. XOR (eXclusive Or): my perception is the inability of perceptron to solve this problem cast a shadow on AI (though this was understood