Deep Learning Fundamentals - Cross Entropy

Preview:

Citation preview

Deep Learning FundamentalsApril 13, 2021

ddebarr@uw.edu

http://cross-entropy.net/ml530/Deep_Learning_1.pdf

Agenda for Tonight

• Homework Review

• [DLP] Part I: Fundamentals of Deep Learning1. What is Deep Learning?

2. Before We Begin: the Mathematical Building Blocks of Neural Networks

3. Getting Started with Neural Networks

4. Fundamentals of Machine Learning

https://twitter.com/DeepLearningAI_/status/1310595139933548546?s=20

Deep Learning with Python

The cover of our text book is captioned “Habit of a Persian Lady in 1568”, from Thomas Jeffreys’ book, “A Collection of the Dresses of Different Nations”

[DLP] Chapter 1: What is Deep Learning?

1. Artificial Intelligence, Machine Learning, and Deep Learninga. Artificial Intelligenceb. Machine Learningc. Learning Representations from

Datad. The “Deep” in Deep Learninge. Understanding How Deep

Learning Works, in Three Figuresf. What Deep Learning Has

Achieved So Farg. Don’t Believe the Short-Term

Hypeh. The Promise of AI

2. Before Deep Learning: a Brief History of Machine Learninga. Probabilistic Modelingb. Early Neural Networksc. Kernel Methodsd. Decision Trees, Random Forests,

and Gradient Boosting Machinese. Back to Neural Networksf. What Makes Deep Learning

Differentg. The Modern Machine Learning

Landscape

[DLP] Chapter 1: What is Deep Learning?

3. Why Deep Learning? Why Now?a. Hardware

b. Data

c. Algorithms

d. A New Wave of Investment

e. The Democratization of Deep Learning

f. Will it Last?

• This chapter covers• High-level definitions of

fundamental concepts

• Timeline of the development of machine learning

• Key factors behind deep learning’s rising popularity and future potential

Artificial Intelligence

• Concise definition: the effort to automate intellectual tasks normally performed by humans

• Initial take: expert rules• Fine for chess

• Difficult to develop rules for image classification, speech recognition, or language translation

Artificial Intelligence

Relationships Between AI, ML, and DL

• Expert Rules

• Linear Regression

• Logistic Regression

• Random Forests

• Gradient Boosting

• Multi-Layer Perceptron Network

• Convolutional Neural Networks

• Recurrent Neural Networks

Artificial Intelligence

Expert Rules Example

% Data: fruit(X) :- attributes(Y)

fruit(banana) :- colour(yellow), shape(crescent).

fruit(apple) :- (colour(green); colour(red)), shape(sphere), stem(yes).

fruit(lemon) :- colour(yellow), (shape(sphere);shape('tapered sphere')), acidic(yes).

fruit(lime) :- colour(green), shape(sphere), acidic(yes).

fruit(pear) :- colour(green), shape('tapered sphere').

fruit(plum) :- colour(purple), shape(sphere), stone(yes).

fruit(grape) :- (colour(purple);colour(green)), shape(sphere).

fruit(orange) :- colour(orange), shape(sphere).

fruit(satsuma) :- colour(orange), shape('flat sphere').

fruit(peach) :- colour(peach).

fruit(rhubarb) :- (colour(red); colour(green)), shape(stick).

fruit(cherry) :- colour(red), shape(sphere), stem(yes), stone(yes).

What is the value for colour?

[red, orange, yellow, green, purple, peach]

green

What is the value for shape?

[sphere, crescent, tapered sphere, flat sphere, stick]

stick

The fruit is rhubarb

http://www.paulbrownmagic.com/blog/simple_prolog_expert

Artificial Intelligence

Machine Learning

• Ada Lovelace, 1843: “The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform.”

• Alan Turing, 1950: quoting Ada Lovelace, while pondering whether general-purpose computers could be capable of learning

trained versus programmed

Artificial Intelligence

Learning Representations from Data

• Need three things (for supervised learning)• Input data points: structured data, image files, sound files, text documents

• Examples of expected output

• Way to measure whether the algorithm is doing a good job

• Input representation examples• Image as Red, Green, and Blue picture element (pixel) values

• Image as Hue, Saturation, and Value

https://en.wikipedia.org/wiki/HSL_and_HSV

Artificial Intelligence

Example Data

• The inputs are the coordinates of our points

• The expected outputs are the colors of the points

• A way to measure whether our algorithm is doing a good job could be the percentage of points that are being classified correctly (accuracy)

Artificial Intelligence

New Representation

import numpy as np

translation = np.array([ -2, -2 ])

theta = 0.25 * np.pi

Rotation = np.array([[ np.cos(theta), - np.sin(theta) ], [np.sin(theta), np.cos(theta) ]])

np.dot(Input + translation, Rotation)

Artificial Intelligence

Deep Neural Network for Digit Classification

• Successive layers of increasingly meaningful representations

• Alternative names for deep learning• Layered representations learning

• Hierarchical representations learning

Artificial Intelligence

Deep Representations Learned by aDigit-Classification Model

Artificial Intelligence

How Deep Learning Works: Part 1 of 3

Neural Network Parameterized by its Weights

Artificial Intelligence

How Deep Learning Works: Part 2 of 3

Loss Function Measures Quality of Network’s Output

Artificial Intelligence

How Deep Learning Works: Part 3 of 3

Loss Score Used as Feedback Signal to Adjust Weights

Artificial Intelligence

Deep Learning Achievements

• Near-human-level image classification

• Near-human-level speech recognition

• Near-human-level handwriting transcription

• Improved machine translation

• Improved text-to-speech conversion

• Digital assistants such as Google Now and Amazon Alexa

• Near-human-level autonomous driving

• Improved ad targeting, as used by Google, Baidu, and Bing

• Improved search results on the web

• Ability to answer natural-language questions

• Superhuman Go playing

Artificial Intelligence

Deep Learning Hype

• Although some world-changing applications like autonomous cars are already within reach, many more are likely to remain elusive for a long time, such as believable dialogue systems, human-level machine translation across arbitrary languages, and human-level natural language understanding

• Previous AI “Winters”:1. XOR (eXclusive Or): my perception is the inability of perceptron to solve this

problem cast a shadow on AI (though this was understood at the time)2. By the early 90s, rule-based systems had proven expensive to maintain

difficult to scale, and limited in scope

• We are currently in the intense optimism phase of a new cycle

Artificial Intelligence

Promise of AI

• Most of the research findings of deep learning aren’t yet applied to the full range of problems they can solve across industries• “Your doctor doesn’t use AI, and neither does your accountant” [I thought I

uploaded an image of a document during tax time]

• “Back in 1995, it would have been difficult to believe in the future impact of the internet”

• “In a not-so-distant future, AI will be your assistant; it will answer your questions, help educate your kids, and watch over your health. It will deliver your groceries to your door and drive you from point A to point B.”

• Don’t believe the short-term hype, but do believe in the long-term vision

Artificial Intelligence

History: Probabilistic Modeling

• Naïve Bayes

• Logistic Regression

Machine Learning

Early Neural Networks

• 1950s: Perceptron

• 1980s: Backpropagation

• Late 1980s: Yann LeCun’s work on MNIST

Machine Learning

History: Kernel Method

• Example kernel method for classification

Machine Learning

History: Decision Tree Ensembles

• Parameters that are learned are questions about the data“Is feature2 in the data greater than 3.5?”

Machine Learning

History: Back to Neural Networks

• AlexNet was not the first fast GPU-implementation of a CNN to win an image recognition contest. A CNN on GPU by K. Chellapilla et al. (2006) was 4 times faster than an equivalent implementation on CPU.[6] A deep CNN of Dan Ciresan et al. (2011) at IDSIA was already 60 times faster[7] and achieved superhuman performance in August 2011.[8] Between May 15, 2011 and September 10, 2012, their CNN won no less than four image competitions.[9][10] They also significantly improved on the best performance in the literature for multiple image databases.[11]

• According to the AlexNet paper,[5] Ciresan's earlier net is "somewhat similar." Both were originally written with CUDA to run with GPU support. In fact, both are actually just variants of the CNN designs introduced by Yann LeCun et al. (1989)

https://en.wikipedia.org/wiki/AlexNet

Machine Learning

Reasons for Deep Learning Success

• Incremental layer-by-layer way in which increasingly complex representations are developed

• Fact that these intermediate incremental representations are learned jointly

Machine Learning

Why Deep Learning? Why Now?

• Three technical forces driving advances in machine learning• Hardware

• Datasets and benchmarks

• Algorithmic advances

• Following a scientific revolution, progress generally follows a sigmoid curve: it starts with a period of fast progress, which generally stabilizes as researchers hit hard limitations, and then, further improvements become incremental

Deep Learning

Models to Try

• Deep Learning should be viewed as another tool in the toolbox

• There are many possible machine learning methods to apply

• Suggestion is to try …• Linear models

• Tree-based ensembles; e.g. random forests and gradient boosting

• Deep learning; e.g. feedforward, convolutional, and recurrent networks

• Gradient boosting and deep learning have won a lot of Kaggle competitions

Deep Learning

[DLP] Chapter 2: Before We Begin, the Mathematical Building Blocks of Neural Networks

1. A first look at a neural network

2. Data representations for neural networks

3. The gears of neural networks: tensor operations

4. The engine of neural networks: gradient-based optimization

5. Looking back at our first example

6. Chapter summary

MNIST Sample Digits

• The Modified (Segmented) National Institutes of Standards and Technology (MNIST) data set is part of the history of Deep Learning

• Yann LeCun (Bell Labs at the time) used this data to learn convolution filters in 1989

http://yann.lecun.com/exdb/mnist/

First Look

Loading the MNIST Data

First Look

Network Architecture and Compilation

First Look

Preprocessing the Data

First Look

Network Training and Evaluation

First Look

Scalars (0D Tensors)

Another name for a number

Data Representations

shape = ()

Vectors (1D Tensors)

Another name for a one-dimensional array of numbers

Data Representations

shape = (5,)

Matrices (2D Tensors)

Another name for a two-dimensional array of numbers

Data Representations

shape = (3, 5)

3D Tensors and Higher-Dimensional Tensors

Packing 2D tensors into an array creates a 3D tensor

Data Representations

shape = (3, 3, 5)

Key Attributes of a Tensor

• Number of axes: sometimes called the rank; sometimes called the number of dimensions [the number of indices for specifying a cell]

• Shape: a list consisting of sizes for the axes of the tensor

• Data type• int32: typically used for word indices and class indices

• uint8: typically used for pixel values

• float32: typically used for numeric features

Data Representations

Displaying a Digit

plt.imshow(Image.open(“filename.png”)) works too

Data Representations

Manipulating Tensors in Numpy

Data Representations

The Notion of Data Batches

The first axis is considered to be the batch axis

Data Representations

Real-World Examples of Data Tensors

• Vector data: 2D tensors of shape (samples, features)

• Timeseries data or sequence data: 3D tensors of shape (samples, timesteps, features)

• Images: 4D tensor of shape (samples, height, width, channels) or (samples, channels, height, width)

• Video: 5D tensor of shape (samples, frames, height, width, channels) or (samples, frames, channels, height, width)

Data Representations

Examples of Vector Data

Data Representations

Examples of Timeseries Data or Sequence Data

Mel Frequency Cepstral Coefficient (MFCC) representation of audio clips fits here as well

Data Representations

Image Data

• Tensorflow uses “channels last” format; while the no-longer-maintained Theano used “channels first” format

Data Representations

Video Data

Data Representations

The Gears of Neural Networks: Tensor Operations

Tensor Operations

Element-wise Operations

Tensor Operations

Broadcasting

• Broadcasting is used to add bias values to the product of an input matrix and a weight matrix

(samples, features) x (features, neurons) + (neurons)

= (samples, neurons) + neurons

• Broadcasting consists of two steps

Tensor Operations

Example: Adding a Vector to a Matrix

Tensor Operations

Tensor Dot: Part 1 of 3

Tensor Operations

Tensor Dot: Part 2 of 3

Tensor Operations

Tensor Dot: Part 3 of 3

Tensor Operations

Tensor Reshaping: Part 1 of 2

Tensor Operations

Tensor Reshaping: Part 2 of 2

Tensor Operations

Geometric Interpretation of Tensor OperationsThis example illustrates something similar to what happens during a weight update operation based on momentum

Tensor Operations

A Geometric Interpretation of Deep Learning

Uncrumpling a complicated manifold of data (think about the XOR problem)

Tensor Operations

What’s a Derivative?

• A derivative quantifies the change in a function’s value as an input changes

• Produces the change needed for the path of steepest ascent

Gradient-Based Optimization

Stochastic Gradient Descent: Part 1 of 3

Gradient-Based Optimization

Stochastic Gradient Descent: Part 2 of 3

Gradient-Based Optimization

Stochastic Gradient Descent: Part 3 of 3

Momentum can help us avoid local minima

Gradient-Based Optimization

Chaining Derivatives: the Backpropagation AlgorithmWe take derivatives for the operations used as part of forward propagation (inference) to update weights

Book says:

Alternative version of chain rule:

𝜕𝑙𝑜𝑠𝑠

𝜕𝑤𝑒𝑖𝑔ℎ𝑡−1=

𝜕𝑙𝑜𝑠𝑠

𝜕𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛−1

𝜕𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛−1𝜕𝑝𝑟𝑜𝑑𝑢𝑐𝑡−1

𝜕𝑝𝑟𝑜𝑑𝑢𝑐𝑡−1𝜕𝑤𝑒𝑖𝑔ℎ𝑡−1

Gradient-Based Optimization

Network Review

Looking Back

Chapter Summary

Activation Functions

[DLP] Chapter 3: Getting Started with Neural Networks1. Anatomy of a Neural Network

2. Introduction to Keras

3. Setting Up a Deep Learning Workstation

4. Classifying Movie Reviews: a Binary Classification Example

5. Classifying Newswires: a MultiClass Classification Example

6. Predicting House Prices: a Regression Example

7. Chapter Summary

Neural Networks

Training involves …

Anatomy of a Neural Network

Relationship between the Network, Layers, Loss Function, and Optimizer

Anatomy of a Neural Network

Layers as the Lego Bricks of Deep Learning ☺

The first layer requires an input_shape parameter [the dimensions of a single observation], while additional layers do not require this parameter

Anatomy of a Neural Network

Networks of Layers

• A deep learning model is a directed, acyclic graph of layers

• Most common instance is a “linear” (simple, sequential) stack of layers

• Other common instances include …• Two-branch networks; e.g. question goes down one branch and text passage

goes down another (or maybe multi-modal input, for example an image and a text description)

• Multi-head networks; e.g. we have one output predict whether a news article discusses “politics” and another output predict whether a news article discusses “health”

• Inception blocks; e.g. we want to use a few different convolution (input filtering) approaches in parallel

Anatomy of a Neural Network

Loss Function and Optimizers

• Loss functions for this class include: cross entropy, mean squared error, mean absolute error, content loss, style loss, total variation loss, Kullback Leibler loss, temporal difference loss, actor loss, and critic loss

• Optimization functions for the class include: Stochastic Gradient Descent (SGD), Root Mean Squared (Gradient) Propagation (RMSProp), and Adaptive Moments (AdaM: RMSProp + Momentum)

Anatomy of a Neural Network

Keras Features

Introduction to Keras

Google Search Interest for Deep Learning Frameworks

Introduction to Keras

Deep Learning Software and Hardware Stack

• Nvidia Graphics Processing Units (GPUs) and Google Tensor Processing Units (TPUs) support efficient deep learning

• Nvidia’s Common Unified Device Architecture (CUDA) Application Programming Interface (API) and the CUDA Deep Neural Network (DNN) library provide an interface to Nvidia GPUs

• Eigen library implements the Basic Linear Algebra Subprograms (BLAS) specification, allowing tensor manipulation on Central Processing Units (CPUs)

Theano and CNTK are no longer maintained

Introduction to Keras

Typical Keras Workflow

Introduction to Keras

Network Definition:Sequential Model versus the Functional APISame model with both methods …

Introduction to Keras

Model Configuration and Training

Introduction to Keras

Two Options for Getting Keras Running

Setting Up a Deep Learning Workstation

Loading the Internet Movie DataBase (IMDB) Sentiment Analysis Datanum_words is the size of the vocabulary

Classifying Movie Reviews

Decoding a Document

Classifying Movie Reviews

Turning Lists of Integers into Tensors

Note: if more than one value in a row is one, we should refer to this as a multi-hot encoding

• one-hot encoding for identifying a class in a dense target vector

• multi-hot encoding to identify the tokens present in a document in a dense input vector

Classifying Movie Reviews

Encoding the Integer Sequences into a Binary Matrix

Classifying Movie Reviews

Architecture Decisions for Simple Feedforward Network• How many layers to use

• How many hidden units to choose for each layer

• Which activation functions to use• Do *not* forget to include activation functions: unexplained suboptimality

will ensue

Classifying Movie Reviews

Common Activation Functions

Rectified Linear Unit (ReLU): max(x,0)

[no saturation issue]

Sigmoid: 1/(1+exp(-x))

[usually used for output layer]

Classifying Movie Reviews

IMDB Network Architecture

• Features flow from bottom to top• An output is called a “head”

• Two hidden layers and an output layer with weights• It’s a deep neural network

• We’ll get to more than one hundred layers soon enough

Classifying Movie Reviews

Model Definition

Classifying Movie Reviews

Parameters and Outputs for a Dense Layer

• Parameters• (Number of Inputs from Previous Layer + 1) * (Number of “Units”)

• + 1 for bias weights: one for each “unit”

• We used to refer to “units” as neurons• The names have been changed to protect the innocent? Our approach was inspired by

neuroscience, but our brains aren’t using RMSProp ☺

• These are the same weight vectors we’ve come to know and love: projecting inputs to a new representation, one feature at a time [the number of “units” is the number of new features for the new representation]

• Output Shape• (Batch Size) x (Number of “Units”)

Classifying Movie Reviews

Why Are Activation Functions Necessary?

Try omitting activation functions from 1) the output layer and 2) hidden layers … so you can recognize this issue later

Classifying Movie Reviews

Compiling the Model

Classifying Movie Reviews

Setting Aside a Validation Set

Classifying Movie Reviews

Training the Model

Classifying Movie Reviews

Plotting the Training and Validation Loss

Classifying Movie Reviews

Where Do We Start Overfitting?

Classifying Movie Reviews

Plotting the Training and Validation Accuracy

Classifying Movie Reviews

Where Do We Start Overfitting?

Classifying Movie Reviews

Retraining the Model from Scratch

Why are we “retraining the model from scratch”?

Classifying Movie Reviews

Generating Predictions on New Data

Classifying Movie Reviews

Ideas for Experiments

Classifying Movie Reviews

Wrapping Up the IMDB Example

Classifying Movie Reviews

Loading the Reuters Dataset

Classifying Newswires

Decoding Newswires Back to Text

Classifying Newswires

Preparing the Document Matrices

Classifying Newswires

Preparing the Target Matrices

Classifying Newswires

Defining the Model

Classifying Newswires

Notes About the Architecture

Classifying Newswires

Validating the Approach

Classifying Newswires

Where Do We Start Overfitting?

Classifying Newswires

Where Do We Start Overfitting?

Classifying Newswires

Retraining a Model from Scratch

Classifying Newswires

Why are we “retraining a model from scratch”?

Comparing to Random[and a Majority Classifier]

Nota bene (note well): 813 of the 2,356 test examples belonged to class 3

The accuracy of a majority classifier is 36.2%

Classifying Newswires

Generating Predictions for New Data

Classifying Newswires

Dense Versus Sparse Labels

Classifying Newswires

Model With an Information Bottleneck

71% accuracy: an 8% absolute drop

Classifying Newswires

Further Experiments

Classifying Newswires

Wrapping Up

Classifying Newswires

Loading the Boston Housing Dataset

1970s home prices in thousands of dollars

Predicting House Prices

Brief Discussion of Bias

• Boston Housing dataset has been used by many popular textbooks

• The data explicitly offers a race-related variable for modeling• Avoid using proxy variables that lead to discrimination based on race, gender,

religion, etc

• Example: don’t ask about gender if all you really want to know is whether the candidate can lift X pounds

Predicting House Prices

http://lib.stat.cmu.edu/datasets/boston

Normalizing the Data

Predicting House Prices

Model Definition

Predicting House Prices

3-Fold Cross-Validation

Predicting House Prices

4-Fold Cross-Validation Implementation

Predicting House Prices

Cross-Validation Loop

Predicting House Prices

Cross-Validation Results

Predicting House Prices

Alternative Implementation [saved history]

Predicting House Prices

Plotting the Average Mean Absolute Error (MAE)

Predicting House Prices

Visualization Suggestions

Predicting House Prices

Smoothing the Curve

The smoothed_points expression should look familiar: RMSProp (0.9 for last squared gradient) and AdaM (0.999 for last gradient)

Predicting House Prices

Plotting the Smoothed MAE

Predicting House Prices

Training the Final Model

Predicting House Prices

Wrapping Up

Predicting House Prices

Chapter Summary

[DLP] Chapter 4: Fundamentals of Machine Learning1. Four Branches of Machine Learning

2. Evaluating Machine Learning Models

3. Data Preprocessing, Feature Engineering, and Feature Learning

4. Overfitting and Underfitting

5. The Universal Workflow of Machine Learning

Supervised Learning Examples

Four Branches of Machine Learning

Unsupervised Learning

• Dimensionality Reduction

• Clustering

Four Branches of Machine Learning

Self-Supervised Learning

• Learning Without Human Annotated Labels

• Autoencoders

• Trying to predict the next word given previous words

• Trying to predict the next frame given previous frames

Four Branches of Machine Learning

Reinforcement Learning

• Google Deep Mind used reinforcement learning to create a model to play Atari games

• AlphaGo was created to play Go

• Occasional rewards

• Examples of possible applications include: self-driving cars, robotics, resource management, and education

Four Branches of Machine Learning

From Yann LeCun …

https://t.co/2LSb622114

Four Branches of Machine Learning

Also from Yann LeCun …

https://t.co/2LSb622114

Four Branches of Machine Learning

Classification and Regression Glossary

Four Branches of Machine Learning

Classification and Regression Glossary

Four Branches of Machine Learning

Simple Hold-out Validation Split

Evaluating Machine Learning Models

Hold-out Validation Implementation[note the concatenation]

Evaluating Machine Learning Models

K-Fold Cross-Validation

Used for smaller data sets• If K is too small, we’ll experience high bias (underfitting)

• If K is too large, we’ll experience high variance (overfitting)

Evaluating Machine Learning Models

K-Fold Cross Validation Implementation

Evaluating Machine Learning Models

Iterated K-Fold Cross-Validation with Shuffling

• history = []

• for i in range(iterationCount):• shuffle(data)

• history.append(crossValidation(data, K = k))

• Requires building iterationCount * K + 1 models

Evaluating Machine Learning Models

Things to Keep in Mind

Evaluating Machine Learning Models

Value Normalization

• Dividing by 255 was an example of min-max normalization:

• value = (value – min(value)) / (max(value) – min(value))

• The max pixel value was 255 and the min pixel value was 0

• Alternatively, you can use center-and-scale normalization:

[-1,1] is fine too

Consider removing outliers

Data Preprocessing, Feature Engineering, and Feature Learning

Missing Values

• “In general, with neural networks, it’s safe to input missing values as 0, with the condition that zero isn’t a meaningful value”

• It’s possible to add indicator variables: 1 if missing; 0 otherwise

• If you expect missing values at test time, be sure to train with missing values:• We train like we deploy and deploy like we train

Data Preprocessing, Feature Engineering, and Feature Learning

Feature Engineering Example

Three different inputs for the “What time is it?” model …

Why no radius on the polar coordinates?

Data Preprocessing, Feature Engineering, and Feature Learning

Feature Engineering

• Does this mean you don’t have to worry about feature engineering as long as you’re using deep neural networks?

• No …

Data Preprocessing, Feature Engineering, and Feature Learning

Original versus Lower Capacity Model

Original Model:16 units for each hidden layer

Lower Capacity Model: 4 units for each hidden layer

Overfitting and Underfitting

Original versus Lower Capacity Model

Smaller network starts overfitting later and it’s performance degrades more slowly

Overfitting and Underfitting

Original versus Higher Capacity Model:Validation Data

Validation Loss Noisierfor Higher Capacity Model(512 versus 16 units for each hidden layer)

Overfitting and Underfitting

Original versus Higher Capacity Model:Training DataMore capacity gives a model the ability to more quickly model the training data, but it also makes it susceptible to overfitting

Overfitting and Underfitting

Regularization [for Smaller Weights]

Overfitting and Underfitting

Example for Effect of Weight Regularization

• Note: the goal of weight regularization is to improvegeneralization performance ☺

• Use “metric” rather than “loss” for comparing generalization performance; e.g. regularized crossentropy can be used for the loss function with crossentropyused for the evaluation metric [this allows us to use the same evaluation function when comparing performance of models on validation data]

Overfitting and Underfitting

Additional Weight Regularizers for Keras

Overfitting and Underfitting

Adding Dropout(dropoutRate)

Overfitting and Underfitting

Adding Dropout to the IMDB Network

Overfitting and Underfitting

Recap of Most Common Ways to Prevent Overfitting

Overfitting and Underfitting

Define the Problem

Universal Workflow of Machine Learning

Hypothesis

Universal Workflow of Machine Learning

Choosing a Measure of Success

• Accuracy

• Precision and Recall

• Area Under the Receiver Operating Characteristic (ROC) Curve (AUC)

• Maximize Recall subject to a constraint on the False Positive Rate?

• Mean Average Precision

Universal Workflow of Machine Learning

Deciding on an Evaluation Protocol

Universal Workflow of Machine Learning

Preparing Your Data

Universal Workflow of Machine Learning

Key Choices for Your First Iteration

Universal Workflow of Machine Learning

Choosing the Last-Layer Activation and Loss Function

Universal Workflow of Machine Learning

How Big Should the Model Be?

Developing a model that overfits …

Universal Workflow of Machine Learning

Regularizing the Model

Universal Workflow of Machine Learning

Tuning the Model

We call these hyperparameters to distinguish them from the parameters of the model; i.e. the weights.

Note: We tune against validation data. Much like the “private leaderboard”, we only get one look at test perf.

Universal Workflow of Machine Learning

Chapter Summary

plot_model() and tensorboard

# To use plot_model() to generate ".png" file:

# $ sudo apt install python-pydot

# $ pip install pydot

# $ pip install graphviz

# To review tensorboard output:

# Start the tensorboard server ...

# $ tensorboard --logdir=logs --bind_all

# Use browser to navigate to tensorboard server ...

# http://host:6006/

# Tensorboard reference: https://github.com/tensorflow/tensorboard

from tensorflow.keras.datasets import mnist

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense, Dropout

from tensorflow.keras.utils import plot_model

from tensorflow.keras.callbacks import TensorBoard

(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = x_train.reshape(60000, 784).astype("float32") / 255

x_test = x_test.reshape(10000, 784).astype("float32") / 255

model = Sequential()

model.add(Dense(512, activation="relu", input_shape=(784,), name = "hidden0"))

model.add(Dropout(0.2, name = "dropout0"))

model.add(Dense(512, activation="relu", name = "hidden1"))

model.add(Dropout(0.2, name = "dropout1"))

model.add(Dense(10, activation="softmax", name = "output"))

model.summary()

plot_model(model, to_file = "model.png", show_shapes = True)

model.compile(loss = "sparse_categorical_crossentropy",

optimizer = "rmsprop",

metrics =[ "accuracy" ])

history = model.fit(x_train, y_train,

batch_size = 128, epochs = 20,

validation_split = 0.1,

callbacks = [ TensorBoard(log_dir = "logs", histogram_freq = 5) ])

score = model.evaluate(x_test, y_test)

print(f"Test loss: {score[0]}\tTest accuracy: {score[1]}")

Bonus Material

TensorBoard Scalars

TensorBoard Graphs

TensorBoard DistributionsFrom top to bottom the lines represent: [maximum, 93%, 84%, 69%, 50%, 31%, 16%, 7%, minimum]

TensorBoard Histograms

Recommended